Announcement

Collapse
No announcement yet.

Reading a whole text file... market data

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reading a whole text file... market data

    I'm opening a data file and reading it "old school" - and it's plenty fast.

    OPEN dpath$+dfile$ FOR INPUT AS #1
    i=0: irecs&=0:irecoff&=0
    WHILE ISFALSE EOF(1)
    Line Input #1, a$
    GoSub parser1
    WEND
    ...


    However... it's seems like I read somewhere that the whole file could be read at once.
    (If so, what's a good technique for parsing the data?)

    Though they are rather large, they look like this...
    (and thanks for any ideas)
    File:$SPX.txt

    "Date","Time","Open","High","Low","Close","Volume","Open Interest"
    06/11/2009,1600,949.65,949.98,943.75,944.89,0,-999999
    06/12/2009,1030,943.44,943.44,935.66,939.03,0,-999999
    06/12/2009,1130,938.76,943.24,938.46,940.33,0,-999999
    06/12/2009,1230,940.3,941.06,938.98,939.99,0,-999999
    06/12/2009,1330,940.0,941.42,938.88,940.41,0,-999999
    06/12/2009,1430,940.42,942.15,940.01,940.53,0,-999999
    06/12/2009,1530,940.93,946.3,940.93,945.33,0,-999999
    06/12/2009,1600,945.77,946.21,941.58,946.21,0,-999999
    06/15/2009,1030,942.45,942.45,927.9,929.51,0,-999999
    06/15/2009,1130,929.25,929.25,921.31,922.03,0,-999999
    06/15/2009,1230,922.11,924.34,921.2,921.39,0,-999999
    06/15/2009,1330,921.38,923.7,920.84,921.02,0,-999999
    06/15/2009,1430,920.97,923.28,919.65,922.97,0,-999999
    06/15/2009,1530,922.72,923.46,919.77,922.67,0,-999999
    06/15/2009,1600,922.91,925.28,922.47,923.72,0,-999999
    06/16/2009,1030,925.6,927.98,923.64,925.32,0,-999999
    06/16/2009,1130,925.13,928.0,924.59,925.85,0,-999999
    06/16/2009,1230,925.84,925.99,921.33,922.16,0,-999999
    06/16/2009,1330,922.1,922.1,913.26,914.95,0,-999999
    06/16/2009,1430,915.02,916.3,911.8,912.65,0,-999999
    06/16/2009,1530,912.13,916.89,911.6,915.11,0,-999999
    06/16/2009,1600,915.31,917.88,911.74,911.97,0,-999999
    06/17/2009,1030,911.89,913.35,907.02,908.18,0,-999999
    06/17/2009,1130,908.06,911.11,903.78,910.2,0,-999999
    06/17/2009,1230,909.96,915.49,908.37,914.15,0,-999999
    06/17/2009,1330,914.15,914.53,911.54,914.53,0,-999999
    06/17/2009,1430,914.73,918.34,914.23,918.34,0,-999999
    06/17/2009,1530,918.44,918.44,911.41,911.41,0,-999999
    06/17/2009,1600,911.16,913.63,909.64,910.71,0,-999999

  • #2
    I use ADO and a schema.ini
    Superfast!

    http://msdn.microsoft.com/en-us/library/ms709353.aspx
    hellobasic

    Comment


    • #3
      Here is the schema.ini (place in same folder)
      rename the datafile to thedoc.txt or adapt the schema.ini and SQL
      Code:
      [thedoc.txt]
      ColNameHeader=True
      CharacterSet=1252
      DecimalSymbol=.
      Format=Delimited(,)
      DateTimeFormat=mm/dd/yyyy hh:nn:ss
      Col1=Date DATETIME
      Col2=TIME LONG
      Col3=OPEN DOUBLE
      Col4=HIGH DOUBLE
      Col5=LOW DOUBLE
      Col6=VOLUME LONG
      Col7="OPEN INTEREST" LONG
      SQL:
      Code:
      SELECT * 
      FROM [thedoc#txt]
      Connectionstring for ADO:
      Code:
      Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\folderwithdatafiles;Extended Properties=TEXT;Persist Security Info=False
      hellobasic

      Comment


      • #4
        .. and if using ADO, this demo puts all the column data into a two-dimensional (col, row) string array... (row 0 contains the column names)...

        Generic 'ADO' Connection and Query Tester (CC 5+/Win 9+) 11-02-08
        Michael Mattias
        Tal Systems (retired)
        Port Washington WI USA
        [email protected]
        http://www.talsystems.com

        Comment


        • #5
          This reads it all in and parses it by $CRLF to an array, then you can further parse it using your parser1 GOSUB.
          Code:
          FUNCTION PBMAIN () AS LONG
             LOCAL allFile AS STRING, iRecs, ii AS LONG
             LOCAL high AS DOUBLE
          
          '   OPEN dpath$+dfile$ FOR BINARY SHARED AS #1
             OPEN "C:\stockData.txt" FOR BINARY SHARED AS #1
             GET$ #1, LOF(#1), allFile
             iRecs = PARSECOUNT(allFile, $CRLF)
             DIM arrOfRecs(iRecs) AS STRING
             PARSE allFile, arrOfRecs(), $CRLF
          
             FOR ii = 0 TO iRecs - 1
                GOSUB parser1
             NEXT
             WAITKEY$
          
             EXIT FUNCTION
          
             parser1:
               high = VAL(PARSE$(arrOfRecs(ii), ",", 4))
               ? STR$(high)
             RETURN
          
          END FUNCTION
          Last edited by John Gleason; 19 Jun 2009, 08:34 AM.

          Comment


          • #6
            You know, after all is said and done, LINE INPUT might be plenty good enough. (It ain't broke, don't fix it?)

            But FWIW you can make that a lot more efficient simply by setting up a bigger buffer with the use of the LEN= clause on the OPEN.

            MCM
            Michael Mattias
            Tal Systems (retired)
            Port Washington WI USA
            [email protected]
            http://www.talsystems.com

            Comment


            • #7
              I wondered about that too because I didn't check the speed of the two PB techniques: whole file vs. record read. So I did check it below and was surprised that the whole read was ~7x faster. I tried the LEN optimization but didn't see a speedup. I tried 1,4,16, and 32k. My tests used PBCC5, 30MB test file.
              Code:
              #COMPILE EXE
              #DIM ALL
              
              FUNCTION PBMAIN () AS LONG
                 LOCAL allFile AS STRING, iRecs, ii AS LONG
                 LOCAL high, t AS DOUBLE
              
              '   OPEN dpath$+dfile$ FOR BINARY SHARED AS #1
                 OPEN "C:\dataFil07.dat" FOR BINARY SHARED AS #1
                 OPEN "C:\dataFil07.dat" FOR INPUT SHARED AS #2'len = &h00400 'I didn't see speed increases with larger rec sizes.
                 t = TIMER
                 DO
                    LINE INPUT #2, allFile
                    GOSUB parser2
                 LOOP UNTIL EOF(#2)
                  ? "done individual rec read: " & STR$(TIMER - t)
                      
                 t = TIMER
                 GET$ #1, LOF(#1), allFile
                 iRecs = PARSECOUNT(allFile, $CRLF)
                 DIM arrOfRecs(iRecs) AS STRING
                 PARSE allFile, arrOfRecs(), $CRLF
                                 
                 FOR ii = 0 TO iRecs - 1
                    GOSUB parser1
                 NEXT
                  ? "done whole file read: " & STR$(TIMER - t)
                 WAITKEY$
                 EXIT FUNCTION
              
                 parser1:
                   high = VAL(PARSE$(arrOfRecs(ii), ",", 4))
                 RETURN
              
                 parser2:
                   high = VAL(PARSE$(allFile, ",", 4))
                 RETURN
              
              END FUNCTION

              Comment


              • #8
                Code:
                 t = TIMER
                   DO
                      LINE INPUT #2, allFile
                      GOSUB parser2
                   LOOP UNTIL EOF(#2)
                    ? "done individual rec read: " & STR$(TIMER - t)
                This is really not a fair 'timing' of the disk access of any sort.... how much of the time was spent in 'Parser2?'

                Not to mention, holding two handles open on a SHARED file introduces some system overhead.

                See PROFILE and #TOOLS ON in the help file; then structure the program to isolate the disk access into its own procedures. Run again.

                Then, look at your results, subtract out the parsing time and decide if perhaps you haven't already spent enough time trying to optimize a section of code which is but a small percentage of the total job.

                MCM
                Michael Mattias
                Tal Systems (retired)
                Port Washington WI USA
                [email protected]
                http://www.talsystems.com

                Comment


                • #9
                  I commented out the GOSUB code and closed and reopened the file rather than having it SHARED, and the difference increased to over 10x. The below results were typical on my machine:
                  Code:
                  done individual rec read:  42.7899999999965
                  done whole file read:  3.57000000000291
                  
                  done individual rec read:  43.2799999999965
                  done whole file read:  3.94999999999767
                  
                  done individual rec read:  40.8600000000012 'this timing was SHARED
                  done whole file read:  3.52000000000058     'this timing was SHARED

                  Comment


                  • #10
                    Now - we're cooking with gas... Thanks all!

                    I am downloading history from this site..
                    You can get 50 to 80 years at once (daily data).

                    Though this method drops in as a cvs file, another one just gives me ".txt"
                    (And it's free)

                    http://finance.yahoo.com/q/hp?s=%5EGSPC

                    http://finance.yahoo.com

                    Yabba Dabba Doooo!!!


                    My "parser1" routine simply takes individual lines and places Date, time, open, high, low, and close into an array.
                    Last edited by Doug Ingram; 19 Jun 2009, 04:18 PM.

                    Comment


                    • #11
                      Nice technique

                      John,

                      Your research on whole file reading speed was simply amazing !
                      I never thought that the speed difference would be that big !
                      I have been using binary read more and more and I noticed it was faster and also simpler to program but I never would have thought that that technique was that much faster than reading the "Old Way"
                      Congratulation on your work as you must have invested a few hours to test it.
                      I will be putting that knowledge to good use.
                      Old QB45 Programmer

                      Comment


                      • #12
                        Originally posted by Michael Mattias View Post
                        [code]
                        Not to mention, holding two handles open on a SHARED file introduces some system overhead.

                        That's a huge under-statement. When a sequential file is shared, there is absolutely no buffering possible. Every line must be re-read from disk, every time, because it may have been altered a few nanoseconds ago. Try it and see...

                        Bob Zale
                        PowerBASIC Inc.

                        Comment


                        • #13
                          Reading the file in a swoop

                          Doug, from the help file for the FILESCAN statement:

                          Example
                          OPEN "datafile.dat" FOR INPUT AS #1
                          FILESCAN #1, RECORDS TO count&
                          DIM TheData(1 TO count&) AS STRING
                          LINE INPUT #1, TheData() TO count&
                          CLOSE #1

                          I use it a lot and it works fine.
                          Francisco J Castanedo
                          Software Developer
                          Distribuidora 3HP, C.A.
                          [URL]http://www.distribuidora3hp.com[/URL]

                          Comment


                          • #14
                            Your research on whole file reading speed was simply amazing !
                            I never thought that the speed difference would be that big !
                            You can also do memory-mapping; that offers the possibility of accessing the file data without the need to load it to your memory space (eg., into a PB array).

                            For delimited records (a standard 'sequential text' file uses CRLF as a record delimiter) there is no real good reason to do this, but for fixed-record length files, you can use something like:

                            Memory Mapped Files instead of RANDOM disk file access 5-8-04

                            Very, very fast; good thing to do with big files (whatever 'big' means).

                            Note that 'as written,' that demo will not handle files larger than the available system space for MMF objects, which has been in my experience somewhere between 400 and 600 Mb.

                            Or, for sequential files you will be processing in a truly sequential fashion, there's also this demo:

                            Memory-Mapped File Version of LINE INPUT

                            Just a few little somethings to keep in mind....


                            MCM
                            Michael Mattias
                            Tal Systems (retired)
                            Port Washington WI USA
                            [email protected]
                            http://www.talsystems.com

                            Comment


                            • #15
                              I'm sad to report I made a mistake in my timing code, but happy to report that it works fine using individual record reads! I didn't know a SHARED file disabled sequential file buffering I said I was surprised at the difference in speed of the two, but I think stunned was my actual state. I didn't understand why changing the record length had no effect. I wondered if there was a problem somewhere, but couldn't find one. Well, problem solved!

                              Check out the new times. The record read is faster. I didn't time Francisco's technique, but it looks fast too.

                              Code:
                              done individual rec read:  2.84999999999767
                              done whole file read:  3.56999999999913
                              
                              done individual rec read:  3.20099502459925E-12
                              done whole file read:  3.20099502459925E-12
                              Code:
                              #COMPILE EXE
                              #DIM ALL
                              
                              FUNCTION PBMAIN () AS LONG
                                 LOCAL allFile AS STRING, iRecs, ii AS LONG
                                 LOCAL high, t AS DOUBLE
                              
                                 OPEN "C:\BinFil08.dat" FOR INPUT AS #2 LEN = &h00800 'I DID see speed increases with larger rec sizes.
                                 t = TIMER
                                 DO
                                    LINE INPUT #2, allFile
                                 LOOP UNTIL EOF(#2)
                                 ? "done individual rec read: " & STR$(TIMER - t)
                                 CLOSE
                              
                                 OPEN "C:\BinFil08.dat" FOR BINARY AS #1
                                 t = TIMER
                                 GET$ #1, LOF(#1), allFile
                                 iRecs = PARSECOUNT(allFile, $CRLF)
                                 DIM arrOfRecs(iRecs) AS STRING
                                 PARSE allFile, arrOfRecs(), $CRLF
                              
                                  ? "done whole file read: " & STR$(TIMER - t)
                                 WAITKEY$ 
                              END FUNCTION
                              Last edited by John Gleason; 20 Jun 2009, 11:16 AM.

                              Comment


                              • #16
                                >LEN = &h00800 'I DID see speed increases

                                &h800 is only 2K, relatively not a heck of a lot more than than the 128 byte default.

                                Don't be such a wimp; you have 2 Gb user memory to play with!

                                At least go one full page size (usually 64K, use GetSystemInfo() function to get page size on target system).

                                Unless I'm mistaken, you will be allocating a full page anyway, so you may as well put all of it to work.
                                Last edited by Michael Mattias; 20 Jun 2009, 11:24 AM.
                                Michael Mattias
                                Tal Systems (retired)
                                Port Washington WI USA
                                [email protected]
                                http://www.talsystems.com

                                Comment


                                • #17
                                  Doug
                                  I don't understand why you are useing LINE INPUT and parsing. Old school Basic has always seperated input fields with commas or CRLF combinations so why are you bringing these 8 variables per line into a string and then parsing them. Far more efficient to let basic do the parsing automatically as it reads the file. Here is a small code example assuming there are only 3 variables per line. Of course this could also be done with a single UDT array. As for the disk yes increase the buffer size.
                                  Code:
                                  FUNCTION PBMAIN () AS LONG
                                      LOCAL x AS LONG
                                      LOCAL a AS STRING
                                      LOCAL TheDate() AS STRING * 10
                                      LOCAL TheTime() AS LONG
                                      LOCAL TheHigh() AS SINGLE
                                      REDIM TheDate(1,000,000)    ' a number larger than the largest number
                                      'of records you will ever get
                                      REDIM TheTime(1,000,000)
                                      REDIM TheHigh(1,000,000)
                                      OPEN dpath$+dfile$ FOR INPUT AS #1
                                      'after opening the file if there is a heading line then
                                      LINE INPUT #1, a
                                      WHILE ISFALSE EOF(1)
                                          INPUT #1, TheDate(x), TheTime(x), TheHigh(x)    'etc too lazy to name all the fields
                                          INCR x
                                      WEND
                                      CLOSE 1
                                      REDIM PRESERVE TheDate(x - 1)
                                      REDIM PRESERVE TheTime(x - 1)
                                      REDIM PRESERVE TheHigh(x - 1)
                                  
                                  END FUNCTION

                                  Comment


                                  • #18
                                    Originally posted by Michael Mattias View Post
                                    You can also do memory-mapping; that offers the possibility of accessing the file data without the need to load it to your memory space (eg., into a PB array).

                                    For delimited records (a standard 'sequential text' file uses CRLF as a record delimiter) there is no real good reason to do this, but for fixed-record length files, you can use something like:

                                    Memory Mapped Files instead of RANDOM disk file access 5-8-04

                                    Very, very fast; good thing to do with big files (whatever 'big' means).

                                    Note that 'as written,' that demo will not handle files larger than the available system space for MMF objects, which has been in my experience somewhere between 400 and 600 Mb.

                                    Or, for sequential files you will be processing in a truly sequential fashion, there's also this demo:

                                    Memory-Mapped File Version of LINE INPUT
                                    MCM
                                    Michael
                                    You must stop perpetuating this myth. This occasion is probably the worst occasion I can think of to use a MMF. The O/S uses 4K of the programs (PB) memory space and then only reads the file in 4K increments, no different to setting to a 4K buffer. It will of course save the entire file to the swap file ready for reuse by other applications but in this case there are no other applications nor does he wish to read it more than once. Just how slow are you trying to make his program run?

                                    Comment


                                    • #19
                                      Which "myth?"

                                      That using MMFs is not faster than using pure sequential or random access?

                                      Well, I guess it must be application-specific, because I have had nothing but success using MMFs.

                                      Minor technical correction required:
                                      It will of course save the entire file to the swap file ready for reuse by other applications
                                      That's not quite accurate, in that the page files will be used for additional access by the current process as well... which is exactly what happens if you load the file to user memory.

                                      Hey, MMFs are another technique.. use it or not, your choice. Me, I've always enjoyed having options.

                                      MCM
                                      Michael Mattias
                                      Tal Systems (retired)
                                      Port Washington WI USA
                                      [email protected]
                                      http://www.talsystems.com

                                      Comment


                                      • #20
                                        Originally posted by Michael Mattias View Post
                                        Which "myth?"

                                        That using MMFs is not faster than using pure sequential MCM
                                        Correct

                                        Comment

                                        Working...
                                        X