Announcement

Collapse
No announcement yet.

Problem with read in large freefile

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with read in large freefile

    Hi,

    I'm using the GetLines function from here to read in an 800MB CSV file:



    This works fine for smaller sized CSV, but for the 800MB one:

    Code:
    UBOUND(pLine())+1
    --- shows "1" (and there are 450,000 rows)

    and

    Code:
    pz = pLine(0)
    a = @pz
    ? a
    shows blank
    (same for pLine(1) etc...)

    Could it be there are limits on the size that could be read in? If so, how get around this?

    Thanks

  • #2
    You are probably running out of memory.
    800MB for buffer plus 800MB for the array = 1600MB string storage space required. (And the REPLACE will use more)

    You could try
    #OPTION LARGEMEM32


    Do you realise that that 21 year old code can be replaced with the PB functions FILESCAN or PARSECOUNT and PARSE or LINE INPUT#

    Comment


    • #3
      Thanks. I used:

      Code:
      #COMPILE EXE
      #INCLUDE "WIN32API.INC"
      OPTION EXPLICIT
      GLOBAL hDlg AS DWORD
      OPTION LARGEMEM32
      and I got "Error - Duplicate Definition"


      As for the 21 year old code - how would I change this? (sorry, a newieb!)

      The reason I used that code was because I'm using Linux and WINE, and on HEX inspection, end of line chars are CHR$(10)

      Comment


      • #4
        That's
        #OPTION LARGEMEM32

        OPTION EXPLICIT (without #) is a completely different statement'

        Cheers,
        Dale

        Comment


        • #5
          I think you will find that this is fast enough

          Code:
          #COMPILE EXE
          #DIM ALL
          
          FUNCTION PBMAIN() AS LONG
          LOCAL lngRecs AS LONG
          LOCAL sArr() AS STRING
              'UNcomment next two lines on first run  to create test file
              'BuildFile
              'exit function
          
              ParseBinInput
          END FUNCTION
          
          
          SUB BuildFile()
              ' Built a test file on first run
              LOCAL strLine AS STRING
              LOCAL lngRecs AS LONG
              strLine = STRING$(1800,"A")
              OPEN "Demo.txt" FOR OUTPUT AS #1
              FOR lngRecs = 1 TO 450000
                  PRINT #1, strLine
              NEXT
              CLOSE #1
              ? "Demo.txt Built"
          END SUB
          
          SUB ParseBinInput()
              LOCAL strBuf AS STRING
              LOCAL sArr() AS STRING
              LOCAL lngRecs AS LONG
              'Open for binary and parse
              OPEN "Demo.txt" FOR BINARY AS #1
              GET$ #1, LOF(1), strBuf
              CLOSE #1
              lngRecs = PARSECOUNT(strBuf,CHR$(13,10))
              ? "Records: " & STR$(lngRecs)
              DIM sArr(1 TO lngRecs)
              PARSE strBuf, sArr(), CHR$(13,10)
              ? "Last record: " & sArr(lngRecs-1) 'demo.txt ends with CRLF so last record is empty
          END SUB

          Comment


          • #6
            Originally posted by Alex Chambers View Post
            As for the 21 year old code - how would I change this? (sorry, a newieb!)
            The reason I used that code was because I'm using Linux and WINE, and on HEX inspection, end of line chars are CHR$(10)
            Replace CHR$(13,10) in post#5 with CHR$(10)

            Comment


            • #7
              Thank you - that is a great help

              It looks like it solves my problem.

              Comment


              • #8
                Do you really need to get all the lines into memory at once?

                If you have a "GetLines()" call, it's the same as doing a LINE INPUT once per line and adding each line to a PB Array.

                SO... if you don't need all the lines at once, just get one at a time and work on them that way.

                Note that by using the LEN= clause of the OPEN statement when the file is opened for a sequential mode (INPUT or OUTPUT), you are specifying a "number of bytes to read at one time." You can set this (at least) as high as 32K. While it does not make all the lines in that 32K of data programaticly accessible, it does cause each physical disk read to grab that much data regardless of content and future calls to LINE INPUT can read from that buffered data and will not need to make an additional physical disk access.

                That is, it will be faster. It also saves your process' memory but most people won't care about that unless you already have a memory-consuming program.

                However, if you cannot live with processing one record at a time, describe your application here and someone will suggest "another way" to solve the problem. After all, some of us er, um, "veterans" here learned to program with only about 250 Kb (that's KILOBYTES with a K) of available (for data) user memory.. and we found ways to do things just fine.


                [LATER]
                Actually, if it's a CSV file you could read it with OLE or ODBC DB access. ADO for CSV files might be the easiest.

                Then again, that might be overkill. Application not currently known
                [/LATER]



                Michael Mattias
                Tal Systems (retired)
                Port Washington WI USA
                [email protected]
                http://www.talsystems.com

                Comment


                • #9
                  Thanks for your post Michael.

                  I'd love a quicker way - the way I am currently using is far too slow. I would definitely appreciate some veteran experience

                  I have a CSV file, 800MB or so. Futures ones could be 1.5GB. This one currently has 450,000 rows and 40 columns of data - with a mix of some character string data, but mainly SINGLE type numbers.

                  I am doing array processing on all the rows in memory, so I'd like to read the CSV into a multidimensional array in memory and access it that way.

                  However - and probably where it slows down - is the read into memory from disk to an array.

                  I only want columns # 1, 5, 8, 10, 11, 28, 33 and 39 from the csv file to be read into memory. The rest can be discarded.


                  Psuedo code would be:


                  Code:
                  GLOBAL csvarray() as SINGLE
                  rows = Read in the number of rows in the csv file on disk
                  
                  REDIM csvarray(rows, 7)
                  
                  for row = 0 to rows
                  
                  csvarray(row, 0) = VAL(csvfileondisk(row, 1))
                  csvarray(row, 1) = VAL(csvfileondisk(row, 5))
                  csvarray(row, 2) = VAL(csvfileondisk(row, 8))
                  csvarray(row, 3) = VAL(csvfileondisk(row, 10))
                  csvarray(row, 4) = VAL(csvfileondisk(row, 11))
                  csvarray(row, 5) = VAL(csvfileondisk(row, 28))
                  csvarray(row, 6) = VAL(csvfileondisk(row, 33))
                  csvarray(row, 7) = VAL(csvfileondisk(row, 39))
                  One bottleneck here is converting the csvfileondisk(row,xxx) string to a SINGLE in csvarray().

                  And another is reading each line & PARSEing the required columns.

                  The CSV file on disk is from Linux and has $LF as an end of line separator.

                  Fast alogarithms very welcome

                  Comment


                  • #10
                    How slow is "too slow".
                    If you post your current import code, and a sample of the first few lines of your CSV file, we would be in a much better position to help you.

                    Comment


                    • #11
                      Hi - Section A below (on debug inspection) - has not completed in 7 hours.

                      Section B is yet to be tested, however I am looking for faster solutions...

                      "rows" variable contains the exact number of rows that are not null


                      Code:
                      REDIM mainCSVstring(0)
                      
                      ' ~~~~~~~~~ SECTION A ~~~~~~~~~~~~
                      ' Read in the CSVfileondisk
                      
                      
                      LOCAL str_filter, q, qq AS STRING
                      LOCAL filelen AS LONG
                      LOCAL strBuf AS STRING
                      LOCAL i, cc AS LONG
                      
                      fn = "800MBCSV.csv"
                      
                      OPEN fn FOR BINARY AS #1
                        GET$ #1, LOF(1), strBuf
                      CLOSE #1
                      
                      str_filter = $LF
                      filelen = PARSECOUNT(strBuf, str_filter)
                      
                      cc = -1
                      
                      IF filelen > 0 THEN
                        FOR i = 1 TO filelen
                          q = PARSE$(strBuf, ANY str_filter, i)
                          qq = TRIM$(q)
                          IF qq <> "" THEN
                      
                             cc = cc + 1
                             REDIM PRESERVE mainCSVstring(cc)
                             mainCSVstring(cc) = qq
                      
                          END IF
                        NEXT i
                      END IF
                      
                      rows = cc
                      
                      ' ~~~~~~~~~ END OF SECTION A ~~~~~~~~~~~~
                      
                      ' ~~~~~~~~~ SECTION B ~~~~~~~~~~~~
                      
                      REDIM csvarray(rows, 7)
                      
                      {{I have code which maps the column names from the csv disk file to this one - not included here. This part does not pose a problem }}
                      
                      FOR i = 0 TO rows
                        FOR j = 0 TO 7
                          n = PARSE$(mainCSVstring(i), {{ROW REF HERE: 1, 5, 8,10,11,28,33 or 39}})
                          csvarray(i, j) = VAL(n)
                        NEXT j
                      NEXT i

                      PS - The PC I run it on has 8MB of RAM and runs linux. I run other PB apps on it without issue, so it's not a memory problem or processor issue as far as I can see.

                      Comment


                      • #12
                        Originally posted by Alex Chambers View Post
                        Hi - Section A below (on debug inspection) - has not completed in 7 hours.
                        Something wrong.
                        My code in Post #5 reads a 450,000 line, 800MB file and parses iti into a 450,000 element array in a few seconds!

                        ---------------------------
                        PowerBASIC
                        ---------------------------
                        Records: 450001 loaded into memoery in 4.008seconds
                        450001 array elements filled in 0.845seconds
                        ---------------------------
                        OK
                        ---------------------------

                        Code:
                        #COMPILE EXE
                        #DIM ALL
                        
                        FUNCTION PBMAIN() AS LONG
                        LOCAL strBuf AS STRING
                        LOCAL sArr() AS STRING
                        LOCAL lngRecs AS LONG
                        LOCAL s AS STRING
                        LOCAL t AS DOUBLE
                        'Open for binary and parse
                        t = TIMER
                        OPEN "Demo.txt" FOR BINARY AS #1
                        GET$ #1, LOF(1), strBuf
                        CLOSE #1
                        lngRecs = PARSECOUNT(strBuf,CHR$(13,10))
                        t = TIMER - t
                        s = "Records: " & STR$(lngRecs) & " loaded into memoery in " & FORMAT$(t, "0.000") & "seconds"
                        t = TIMER
                        DIM sArr(1 TO lngRecs)
                        PARSE strBuf, sArr(), CHR$(13,10)
                        t = TIMER - t
                        s += $LF & STR$(UBOUND(sArr())) & " array elements filled in " & FORMAT$(t, "0.000") & "seconds"
                        ? s
                        END FUNCTION
                        Can you post the first few lines of your CSV file?

                        Comment


                        • #13
                          OK The reason for it taking a long time is obvious:
                          Your loop in unbelievable inefficient. You have:
                          • 450,000 IF statements * see note 1 below
                          • 450,000 assignments of q to qq
                          • 450,000 separate invocation of PARSE$ to locate each line in turn. * see note 2 below
                          • 450,000 TRIM$ * see note 3 below
                          • 450,000 REDIM PRESERVEs ( which are also very time consuming)

                          In my code above, those 2,250,000 instructions are replaced with TWO instructions which do essentially the same thing in less than a second.

                          Code:
                          DIM sArr(1 TO lngRecs)
                          PARSE strBuf, sArr(), CHR$(13,10)

                          * Note: 1 If you suspect that there may be blank lines and you want to get rid of them, see the recent thread on removing blanks from an array..
                          Based on discussions in this thread: https://forum.powerbasic.com/forum/u...etc#post809982 (https://forum.powerbasic.com/forum/user-to-user-discussions/programming/809982-is-there-a-way-to-prevent-empty-array-string-elements-from-sorting-above-a-b-c-etc#post809982) Please post any comment there! ARRAY SORT arr() will put all

                          * Note 2, Very time consuming - each PARSE$ has to traverse strBuf from the beginning, counting each $LF until it gets up to "i".
                          * Note 3 PARSE automatically strips leading and trailing spaces from value..

                          Comment


                          • #14
                            Thanks Stuart, really appreciate it.

                            I shall make the changes to Section A later today - it is useful to see where the code inefficiencies are

                            Is there a quicker way to do Section B? - as I anticipate this being slow as well!

                            Comment


                            • #15
                              Originally posted by Alex Chambers View Post
                              Thanks Stuart, really appreciate it.

                              I shall make the changes to Section A later today - it is useful to see where the code inefficiencies are

                              Is there a quicker way to do Section B? - as I anticipate this being slow as well!
                              Give us some sample data to work with and we may be able to speed it up

                              Comment


                              • #16
                                The first 80 lines with header row:

                                RAW:


                                CSV download:


                                Anymore please let me know

                                Comment


                                • #17
                                  Originally posted by Alex Chambers View Post
                                  The first 80 lines with header row:

                                  RAW:


                                  CSV download:


                                  Anymore please let me know
                                  Can you confirm that the originals are LF delimited? (Those two came through as CRLF)

                                  (It's 11:30 pm here, so I'll take a look at it in the morning).

                                  Comment


                                  • #18
                                    Hi - the originals are $LF. Maybe the post into a browser and storage at pastebin did something.

                                    I wrote a nice bit of code that works it out automatically:

                                    Code:
                                    ' Is this a UNIX ($LF==chr$(10)) or Windows file? ($CRLF=chr(13,10))
                                    
                                    strfilter = IIF$(INSTR(strBuf, $LF) > 3, $LF, $CRLF)
                                    lngRecs = PARSECOUNT(strBuf, strfilter)
                                    Great, look forward to seeing your code

                                    Comment


                                    • #19

                                      I only want columns # 1, 5, 8, 10, 11, 28, 33 and 39 from the csv file to be read into memory. The rest can be discarded.
                                      If you treat your file as a database you can do that.. you just SELECT the columns desired. As luck would have it, there is code in the source code section to do that right here:

                                      Generic 'ADO'Connection and Query Tester (CC 5+/Win 9+) 11-02-08

                                      Plus, it's not just disk reading when you load your data to a PB string array.. it takes time to allocate all the individual strings.

                                      However, Stuart has found your real problem with performance, which is "less than optimal" [a KIND description] code design and technique.
                                      Michael Mattias
                                      Tal Systems (retired)
                                      Port Washington WI USA
                                      [email protected]
                                      http://www.talsystems.com

                                      Comment


                                      • #20
                                        Decided to have a nightcap and a play with it.

                                        This one takes about 10 Seconds to build the numeric array for a 450,000 line file after the few seconds to create the initial string array.

                                        '
                                        Code:
                                        #COMPILE EXE
                                        #DIM ALL
                                        
                                        FUNCTION PBMAIN() AS LONG
                                        LOCAL strBuf AS STRING
                                        LOCAL sArr() AS STRING
                                        LOCAL lngRecs,x  AS LONG
                                        LOCAL s AS STRING
                                        LOCAL t AS DOUBLE
                                        'Open for binary and parse
                                        t = TIMER
                                        OPEN "alldata.csv" FOR BINARY AS #1
                                        GET$ #1, LOF(1), strBuf
                                        CLOSE #1
                                        lngRecs = PARSECOUNT(strBuf,CHR$(10))
                                        t = TIMER - t
                                        s = "Records: " & STR$(lngRecs) & " loaded into memory in " & FORMAT$(t, "0.000") & "seconds"
                                        t = TIMER
                                        DIM sArr(1 TO lngRecs)
                                        PARSE strBuf, sArr(), CHR$(10)
                                        t = TIMER - t
                                        s += $LF & STR$(UBOUND(sArr())) & " array elements filled in " & FORMAT$(t, "0.000") & "seconds"
                                        ? s
                                        
                                        strBuf = ""  'free the memory!!!
                                        
                                        DIM dArr(1 TO lngRecs -1,7) AS DOUBLE
                                        LOCAL flds AS LONG
                                        t = TIMER
                                        flds = PARSECOUNT(sArr(2))
                                        DIM tmpArr(1 TO flds) AS STRING
                                        FOR x = 1 TO lngRecs -1
                                           PARSE sArr(x+1), tmpArr(),","  'Skip first header row
                                           darr(x,0) = VAL(tmpArr(1))
                                           darr(x,1) = VAL(tmpArr(5))
                                           darr(x,2) = VAL(tmpArr(8))
                                           darr(x,3) = VAL(tmpArr(10))
                                           darr(x,4) = VAL(tmpArr(11))
                                           darr(x,5) = VAL(tmpArr(28))
                                           darr(x,6) = VAL(tmpArr(33))
                                           darr(x,7) = VAL(tmpArr(39))
                                        NEXT
                                        ERASE sArr() ' free up that memory too!
                                        t = TIMER - t
                                        ? "Parse numerics (" & STR$(x) & ") rows in " & FORMAT$( t,"0.000") & " seconds
                                        ? STR$(dArr(1000,4))
                                        END FUNCTION
                                        '

                                        Comment

                                        Working...
                                        X