Announcement

Collapse
No announcement yet.

Shrunken Sequential Files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Shrunken Sequential Files

    There’s a lot of air in sequential files. Suppose you have data in two arrays
    a() as string, b() as long
    (with no double quotes inside the strings) and save it all in a sequential file:

    For i = 0 to N
    Write #1, a(i), b(i)
    Next


    In an editor the resulting file will look something like:

    "asdasd",23
    "fdfasf,affdsa",9
    "",85
    "asdfsadf",0
    ‘ffddsa",4


    etc with a CRLF (carriage return, linefeed) after each line, including the last. A comma or CRLF (not both at the same time) mark the end of a field.

    Though it may not be in the BASIC language specification the quotes about a string are unnecessary – when the program comes to input this sequential file – if the string contains no comma or CR. Both an empty string and a numeric zero can be omitted (as long as the comma is kept) and when input the nothingness will be considered "" or 0 depending on the data type of the input variable. Finally, all the CRLF can be replaced with commas, including the last one.

    Thus the above data can be saved to a less airy file, one that in an editor looks like:

    asdasd,23,"fdfasf,affdsa",9,,89,asdfsadf,,ffddsa,4,

    with no CRLF at the end. And this file can be read using exactly the same code that reads the first file:

    For i = 0 to N
    Input #1, a(i), b(i)
    Next


    The problem is how to save the data the less airy way. The solution is to consider the file as a binary file simulating a sequential file. (Of course in the last analysis a file is just bytes and is called this or that type of file only because it is written or read a certain way.) The following code (PBCC) illustrates this idea.
    Code:
    $comma = ","
    '========================================
    'This puts quotes around a string only if it contains a comma or
    'carriage return, then adds a comma.
    Function Format1(a As String) As String
     If InStr(a, $comma)  > 0 Or  InStr(a, $Cr)  > 0 Then
      Function = $Dq + a + $Dq + $comma
     Else
      Function = a + $comma            'empty strings go here
     End If
    End Function
    '========================================
    
    'This converts numbers to strings, but if zero
    'makes it nothing, then adds a comma.
    Function Format2(n As Long) As String
     If n Then
      Function = Format$(n) + $comma
     Else
      Function = $comma                'zeros go here
     End If
    End Function
    '========================================
    
    'Create data and a standard sequential file.
    'Then create a shrunken sequential file as a 
    'binary file, then read it as a sequential file.
    'Are the printouts the same?
    Function PBMain
     Dim i As Long, k As Long, p As String
     Dim a(10) As String, b(10) As Long
     Dim aa As String, bb As Long
     Dim OurFileNew As String, OurFileOld As String
    
     OurFileNew = "fooNew.txt"
     OurFileOld = "fooOld.txt"
    
     'create data a(), b() and save to a sequential file
     Open OurFileOld For Output As #1
     For i = 1 To 10
      k = Rnd(1,7)
      a(i) = Mid$("abcd,efg", k, Rnd(2, 9-k))   'random string
      b(i) = Rnd(0,10)                         'random number
      Write #1, a(i), b(i)
      Print a(i), b(i)
     Next
     Close #1
     Print "-----------------"
    
     'Must kill or rename or make zero length any
     'any existing OurFileNew because binary output
     'doesn't truncate.
     If Len(Dir$(OurFileNew)) Then Kill OurFileNew
    
     'open/create another file as a binary file,
     'write data a new way
     Open OurFileNew For Binary As #1
     For i = 1 To 10
      p = Format1(a(i)) + Format2(b(i))
      Put #1, , p
     Next
     Close #1
    
     'open as a sequential file, see if data is the same
     Open OurFileNew For Input As #1
     Do Until Eof(1)
      Input #1, aa, bb
      Print aa, bb
     Loop
     Close #1
     Print
    
     WaitKey$
    End Function
    Note the Kill instruction. Unlike when you open a file as sequential (“For Output”), opening it for binary, then writing to it, then closing it again doesn’t truncate the file to the length of what was written. If the written data contains fewer bytes than the original file, the old bytes lying beyond the written data will remain there. When the program later reads the file sequentially, you’re in trouble when it gets to the old data.

    So before writing the binary file you must first either kill the original, or rename it, or open it for sequential output and immediately close it, which last sets its length to zero. (Another solution, if the file grows in the long run, is to have a special ending record that simply marks where the sequential input must stop.)

    I shrank a “real world” sequential data file by 21% using this simple technique, and since only my program reads it, the format doesn’t matter.

    (Of course going binary all the way would be the most compact because numbers could be represented by the number of bytes they take up in use, instead of by strings of the decimal version, and strings with commas or carriage returns could be delimited by a one byte special character. But it’s a lot of trouble for little improvement if most of the numbers are single digits and most of the strings are made of letters.)

    Now I didn’t want the customer to feel cheated when he saw his shrunken data file, so I have my program pad it back up to it’s original size using a custom made bloat procedure.
    Last edited by Mark Hunter; 18 Dec 2009, 10:14 PM.
    Algorithms - interesting mathematical techniques with program code included.

  • #2
    So before writing the binary file you must first either kill the original, or rename it, or open it for sequential output and immediately close it, which last sets its length to zero. (Another solution, if the file grows in the long run, is to have a special ending record that simply marks where the sequential input must stop.)
    Code:
    SEEK     hFile, FILEATTR(hFile, -2&)   ' logical first byte based on BASE= statement in OPEN
    SETEOF  hFile   ' truncate here
    Michael Mattias
    Tal Systems (retired)
    Port Washington WI USA
    [email protected]
    http://www.talsystems.com

    Comment


    • #3
      [-withdrawn-]
      Last edited by Mark Hunter; 19 Dec 2009, 08:28 PM.
      Algorithms - interesting mathematical techniques with program code included.

      Comment


      • #4
        If "take the wind out of a sequential file by writing it as binary while still reading it as sequential.” then the easy way is to save each record as an element of a string array and use the PUT hFile, ,stringArray() statement.

        That does automatic writing of length words and data (which many of us have done ourselves because this is a fairly recent addition to the compiler's command set) (recent being relative), and you can read it back with a simple "GET hFile, StringArray()" statement.

        These are just some other things to look at..things you, too, may have missed. Me, I am pretty good at picking up the "NEW!" stuff in upgrades, but I frequently miss the new features added in the "IMPROVED!" statements and functions.

        MCM
        Michael Mattias
        Tal Systems (retired)
        Port Washington WI USA
        [email protected]
        http://www.talsystems.com

        Comment


        • #5
          The above shrunken file method uses a binary write and a sequential read. We can use a binary read as well by getting the entire file into one string and parsing.

          Items are assumed not to have commas. That is, any commas in the input have been replaced by a symbol (not allowed in the input) standing for a comma before being put in the file. When read, the symbol is replaced by a comma before being shown to the user. (The same could be done for quote-marks, in which case the first do-loop within the master do-loop below can be omitted. Or you could use PB’s Parse instruction.)

          If the data is divided into records you would need to count off the items in each record.

          For PBCC:
          Code:
          Function PBMain
          Dim a As String, p As Long, n As Long, i As Long
          
          'test input
          a = "xxx,yyy,""uuu,vvv"",zzz,"
          ' a = """uuu,vvv"",xxx,yyy,zzz,"
          ' a = "xxx,yyy,zzz,""uuu,vvv"","
          ' a = ",,,,"
          ' a = ""
          ' a = "xxx"
          ' a = "xxx,yyy"
          ' a = "xxx,yyy"",zzz,"
          ' a = "xxx,""yyy,zzz," '<-- hangs unless check for over-running string length
          
          Print a : Print
          n = Len(a)
          p = 1
          '------------------------
          Do
          'quotemark?
          If Asc(a, p) = 34 Then
          i = p + 1
          'scan for concluding quotemark
          Do
          Incr p
          If Asc(a, p) = 34 Then Exit Do
          Loop Until p > n
          If p > n Then Print "corrupted string" : WaitKey$ : Exit Function
          Print Mid$(a, i, p - i)
          Incr p
          Else
          i = p
          'scan for concluding comma
          Do
          If Asc(a, p) = 44 Then Exit Do
          Incr p
          Loop Until p > n
          Print Mid$(a, i, p - i)
          End If
          Incr p
          Loop Until p > n
          '-----------------------
          WaitKey$
          End Function


          Algorithms - interesting mathematical techniques with program code included.

          Comment


          • #6
            The above shrunken file method uses a binary write and a sequential read. We can use a binary read as well by getting the entire file into one string and parsing.
            Not officially supported, but works:

            By keeping track of where you are in a file (SEEK() function), you can "go back" or "jump around" even when a file is opened in sequential mode (OPEN .. FOR INPUT). You place the file pointer where desired (SEEK statement) and use sequential access commands (LINE INPUT, including the LINE INPUT of PB arrays) at that point.

            Back to supported stuff:

            On the underlying issue here, assuming you control the design of the file, comma-separated values and expanding numerics into display decimals is creating a lot air in the file, too. Here you could go with with writing delimited data or even by explicitly specifying the data length which follows.

            In this example..
            Code:
            FOR I = 1 TO numRec&
                Write #1, a(i), b(i)   ' a() is a string, b() is a LONG
            NEXT
            You could write this as..
            Code:
            OPEN  filename FOR BINARY  AS hfile
            FOR Z = 1 TO NumRec&              ' number of records to be written
               PUT hFile, CLNG (LEN (A(Z))    ' write length of first line as long
               PUT$ hFile, A(Z)                        ' write characters of string data element
               PUT hFIle, B(Z)                         ' write the second value
            NEXT
            And retrieve with ..

            Code:
            OPEN filename FOR BINARY ..as hFile
            DO
               GET hFile, charlen&       ' read (LONG) character length for first element
               GET$ hFIle, charlen&, A$ ' read the string character value
               GET hFile, Bvalue&              ' read the LONG value
            LOOP UNTIL SEEK(hFIle) >= EOF(hFile) + 1   ' Exit when you have reached (but not exceeded) filesize
            No, you can't call up the file in Notepad to check your work.. but you get the air out of the file.
            Michael Mattias
            Tal Systems (retired)
            Port Washington WI USA
            [email protected]
            http://www.talsystems.com

            Comment


            • #7
              Originally posted by Michael Mattias View Post
              Not officially supported, but works:
              By keeping track of where you are in a file (SEEK() function), you can "go back" or "jump around" even when a file is opened in sequential mode (OPEN .. FOR INPUT).
              What makes you say "not officiall supported" ?

              The Help section on SEEK imposes no such restriction and indeed it makes specific reference to using SEEK for ANY type of open file.
              "If file filenum& was opened in random-access mode, SEEK returns the record number of the next record to be written or read as a Quad-integer (64-bit) value. If the file was opened in any other mode, SEEK returns the byte position of the next byte to be written or read"
              =========================
              https://camcopng.com
              =========================

              Comment


              • #8
                The SEEK() function is totally supported. What is not explicitly supported is the use of the SEEK statement followed by the use of the sequential input commands INPUT# and LINE INPUT on a file which has been opened in either PB-supported random access mode (RANDOM, BINARY).

                Of course the doc still refers to sequential, random and binary files, an error going back to the PB for MS-DOS days. A file is a file is a file; but the access mode may be sequential, or random access.

                That said, I probably should not have mentioned it, and stuck to the thread raison d'etre of "shrinking your storage requirements."
                Michael Mattias
                Tal Systems (retired)
                Port Washington WI USA
                [email protected]
                http://www.talsystems.com

                Comment

                Working...
                X
                😀
                🥰
                🤢
                😎
                😡
                👍
                👎