No announcement yet.

How to corrupt a data file!

  • Filter
  • Time
  • Show
Clear All
new posts

  • How to corrupt a data file!

    This is *NOT* a complaint on PowerBasic. It's how some very subtle
    data file corruption can happen using PB and other file operations
    I had no idea could happen, all these years gone by now, until today.

    For a number of purposes at a few places I use string data with a
    pattern of not <CR LF> end of line, but just the ASC(10), followed
    by more characters, then another ASC(10) and so on. I write that
    string operation with the WRITE function in PB 3.5., and read it as
    well. The data in the case where I observed this interesting way
    to corrupt that string, is not ever changed in normal client actions
    for my suite. It, and CRC checksums for things related to it are
    purely administrative issues. Imagine my surprise when yesterday I
    saw the whole thing come apart!

    Before the corruption here is a sample of a snip that got corrupted:

    000050 45 20 45 58 43 4C 55 53 49 56 45 20 55 53 45 20 E EXCLUSIVE USE
    000060 4F 46 3A 0A 0A 0D 5A 69 70 6C 6F 67 2C 20 49 6E OF: Ziplog, In
    Note the adjacent <0A OA OD> characterss in the above string data.

    Suddenly I see the following:

    000050 48 45 20 45 58 43 4C 55 53 49 56 45 20 55 53 45 HE EXCLUSIVE USE
    000060 20 4F 46 3A 0D 0A 0D 0A 0D 5A 69 70 6C 6F 67 2C OF: Ziplog,
    Note the corruption into the sequence <0D 0A 0D 0A 0D> in the above!

    Years and years of using this has never been a problem, although for
    countless hundreds of thousands of times PB 3.5 has read, written this
    data over and over without EVER showing the above.

    Well, until yesterday. And it is *NOT* a PB 3.5 issue either. Here
    is how it can be provoked.

    I started using WordStar back in CPM86 days. Oh yes. I know exactly
    one of the issues with use of it even as to Version 7 for DOS and
    even the WordStar for Windows. Even in TEXT mode, it appends from
    the end of a text file to the adjacent sector point of the end of
    the file with additional EOF characters as in:

    001500 6C 20 72 65 63 6F 72 64 2E 0D 0A 1A 1A 1A 1A 1A l record.
    And yes, I'm fully aware that this can cause serious problems for
    some programs and data files. A prime example of the problems that
    can cause is the early CONFIG.SYS file, for example, in OS/2 operations.
    In this case, until IBM fixed that issue, a whole host of troubles
    could be caused in OS/2 setup operations, particularly in regard to
    PEER LAN work. So if one used WS or any other program which did
    this, and there are others, the only way to be sure that you don't
    corrupt data files from this is to strip the surplus EOF characters
    from the end of the file if you use and editor which produces them.

    And for all source code work for the compiler use in PB 3.5, Bob has
    been very careful in his better than WS compatibility work to provide
    that if you do use an editor which leaves them in the source, it
    makes no difference to the PB toolset. So what.

    But .. that turns out to be NOT TRUE in relation to data files with
    embedded CHR$ <OA OA> constructs back up in a file that has the EOF
    characters at the end of it!

    What I can now prove is that if, by accident, the EOF characters are
    present at the end of a DATA file, and I then EDIT that data file
    with, say QEDIT for DOS, or TED for DOS, which does *NOT* add the
    trailing EOF characters, the following will happen.

    1.) The trailing EOF characters will be eliminated when you simply
    open and save the data file.

    2.) Each of the CHR$(OA) marks in the data file ABOVE the end of the
    file will be replaced with a PRECEEDING CHR$(0D) in front of the correct
    CHR$(0A) data for that byte in the file!!


    I can absolutely prove this happens when the data file that was saved
    in a text mode simple open and close in WS7 is followed with a simple
    open and close in QEDIT for DOS (Or QEDIT for OS/2!), and worse ..

    If the data file is READ with PB 3.5 for DOS that has the trailing EOF
    marks, and it absolutely *DOES* have the proper construct in the data
    above it, when that file is written again with *NO* change in the
    string data for the afflicted string, here in my world it is written
    back to the disk with the corrupted data in it.


    I have not checked this on any development system other than DOS-VDM
    work in OS/2, so I can't tell whether it is also an issue in FREEDOS,
    or MS-DOS 6.2+ for example. It is *NOT* an issue with any normal use
    of PB 3.5 in DOS-VDM or FREEDOS or MDOS 6.2+ for any normal use of
    such data files. It's not a user issue, as I would view this.

    But as a developer, this is a very interesting glitch which I can see
    would present a perhaps very hard to find error. Which I thought I
    would describe in an effort to help others here. Just a casual look
    at data or with any tool that presented an EOF issue. and even an
    inadvertent save of that file, can really create havoc...

    FWIW ..

    Mike Luther
    [email protected]
    Mike Luther
    [email protected]

  • #2
    Let's cover a few basics here. In the DOS world, each end of line
    is represented by TWO characters, a $CR carrage return, which is
    decimal 13 or appears as <0D> in hex, and a $LF line feed character,
    which is a decimal 10 and shows up as <0A>. Remember that working
    together, these both mean the end of one line <0D 0A>.

    In the Unix/Linux/Mac world single character, the $LF <0A> is used
    to represent the end of a single line. There are untilities and
    even PowerBasic programs ro convert files between the two formats.

    Someone once posted a notice on these forums that apparently, PB
    recognizes the end of a line based solely on the $CR <0D> character.
    I haven't tested this, but that in itself is not really an issue,
    unless you are not using the $CRLF <0D 0A> format. But if you use
    the PRINT or PRINT# commands, the default print will end with a
    $CRLF for each line. But for PRINT, you can suppress this natural
    behavour with a final semi-colon ( . If you only want it to put
    out a LF instead of a $CRLF, then you could do a PRINT $LF;.

    On the other hand, using the WRITE command does not support the
    use of a final semi-colon to suppress its normal behavour, which
    is also to put out a $CRLF. So the reason you are getting a
    $CRLF when your original file only used $LF is because you are
    using the wrong command for the job. You should be using
    PRINT $LF; to ensure you only get a line feed character.

    You may like the INPUT# and WRITE combination, but based on the
    information you provided, this is not the optimum combination
    for your application. You can of course consider the BINARY
    file mode, where you can use GET$ and it's complement, PUT$.

    This is a sample program showing what happens when you use the
    WRITE command. Note in this case I used GET$ to read the whole
    file back into a single string, then the code shows you the
    exact contents of that string. You should have no problem
    using PARSECOUNT and PARSE$ to extract any delimited items in
    the file. However, since the normal INPUT and WRITE operators
    recognize commas as field separators, and string text inside
    of doublequotes that may also contain commas, you may have some
    problems if you have files that allow the doublequote delimitors.
    But you can work around this. Fact is, you can create an
    intelligent version of a PRINT# operation that writes hardcoded
    comma and tab codes, surrounds text with doublequotes, and knows
    to only write a $LF at the end of each line.
    #DIM ALL
    #INCLUDE "c:\win32api\"
    GLOBAL a, b, c, d, e, f, g, h, i, j, k, l AS LONG
    GLOBAL aa, bb, cc, dd, ee, ff, gg, hh, ii AS STRING
    GLOBAL m, n, o, p, q, r, s, t AS DWORD
    GLOBAL u, v, w, x, u, z AS DOUBLE
      COLOR 15,1
      OPEN bb FOR OUTPUT AS #1
      WRITE #1, "NOW"
      PRINT SEEK(1),LOF(1)
      WRITE #1,"IS"
      PRINT SEEK(1),LOF(1)
      WRITE #1,"THE"
      PRINT SEEK(1),LOF(1)
      WRITE #1,"TIME."
      PRINT SEEK(1),LOF(1)
      CLOSE #1
      OPEN bb FOR BINARY AS #1
      PRINT LOF(1)
      GET$ 1,LOF(1),aa
      CLOSE 1
      FOR a=1 TO LEN(aa)
        PRINT HEX$(ASC(aa,a),2)" ";
      PRINT aa$
    Old Navy Chief, Systems Engineer, Systems Analyst, now semi-retired

    [This message has been edited by Donald Darden (edited September 19, 2005).]