Announcement

Collapse
No announcement yet.

Reading a whole text file... market data

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Nick Luick
    replied
    06/11/2009,1600,949.65,949.98,943.75,944.89,0,-999999
    06/12/2009,1030,943.44,943.44,935.66,939.03,0,-999999
    06/12/2009,1130,938.76,943.24,938.46,940.33,0,-999999
    06/12/2009,1230,940.3,941.06,938.98,939.99,0,-999999
    06/12/2009,1330,940.0,941.42,938.88,940.41,0,-999999
    06/12/2009,1430,940.42,942.15,940.01,940.53,0,-999999
    One needs to examine the data more carefully. Notice the extra comma's inserted and will it always be in that format.
    Last edited by Nick Luick; 23 Jun 2009, 03:44 PM.

    Leave a comment:


  • Jeff Blakeney
    replied
    Originally posted by John Petty View Post
    John & Jeff
    Both most interesting and probaby usefull info but neither addresses what Doug is trying to do. He actually wants the individual fields. The methods you have both presented require reading the data twice to break up into strings either by filescan or parsing in memory. To get what he wants he then needs to parse each individual string again to get the seperate fields (a 3rd read and second parse), why not do it all in the first pass as basic has always done (thus my code example).
    Doug has pointed out that there may be different numbers of fields in different reports, the one I looked at only had 7 where his example has 8. Simple do a parsecount on the heading line first and adjust the number of Input fields.
    Actually for this type of data then there should be no difference past the heading line if they are CSV or TXT.
    John
    What I was addressing was the fact that John did a test that had LINE INPUT loading the data from disk faster than using GET$ to get the whole thing at once. His test was biased because he did more work with the GET$ in his test program so I gave him another test program to show that LINE INPUT is slower than GET$.

    Once the data is in memory, it will take the same amount of time to parse out the individual fields with either method. Doug was looking for a way to make sure it ran as fast as possible and he can save a bit of time by loading the entire file into memory using GET$ and then parsing the data into his arrays.

    However, in this case, the data he's getting is quite small. I went to the site that Doug posted and grabbed the daily data from 1950 to present and it is only 788 KB. My test showed that just loading the data for a 10.4 MB file took only .28 seconds in the worst case so a 788 KB file would take even less.

    Leave a comment:


  • John Petty
    replied
    John & Jeff
    Both most interesting and probaby usefull info but neither addresses what Doug is trying to do. He actually wants the individual fields. The methods you have both presented require reading the data twice to break up into strings either by filescan or parsing in memory. To get what he wants he then needs to parse each individual string again to get the seperate fields (a 3rd read and second parse), why not do it all in the first pass as basic has always done (thus my code example).
    Doug has pointed out that there may be different numbers of fields in different reports, the one I looked at only had 7 where his example has 8. Simple do a parsecount on the heading line first and adjust the number of Input fields.
    Actually for this type of data then there should be no difference past the heading line if they are CSV or TXT.
    John

    Leave a comment:


  • Jeff Blakeney
    replied
    Originally posted by John Gleason View Post
    I'm sad to report I made a mistake in my timing code, but happy to report that it works fine using individual record reads!
    I think you still have a problem with your new test.

    Code:
    #COMPILE EXE
    #DIM ALL
    
    FUNCTION PBMAIN () AS LONG
       LOCAL allFile AS STRING, iRecs, ii AS LONG
       LOCAL high, t AS DOUBLE
    
       OPEN "C:\BinFil08.dat" FOR INPUT AS #2 LEN = &h00800 'I DID see speed increases with larger rec sizes.
       t = TIMER
       DO
          LINE INPUT #2, allFile
       LOOP UNTIL EOF(#2)
       ? "done individual rec read: " & STR$(TIMER - t)
       CLOSE
    
       OPEN "C:\BinFil08.dat" FOR BINARY AS #1
       t = TIMER
       GET$ #1, LOF(#1), allFile
       iRecs = PARSECOUNT(allFile, $CRLF)
       DIM arrOfRecs(iRecs) AS STRING
       PARSE allFile, arrOfRecs(), $CRLF
    
        ? "done whole file read: " & STR$(TIMER - t)
       WAITKEY$ 
    END FUNCTION
    In your first test, all you are doing is reading in each record and not keeping any of those records but the last one in memory. In the second, your are reading in the entire contents of the file but then going on to calculate the number of records, dimensioning an array to hold the parsed data and then parsing the data.

    Not a very fair test if you ask me. Here is another test program that will give the results for reading the entire contents of the file into an array using either LINE INPUT or PARSE which is a fairer test.

    Code:
    #COMPILE EXE
    #DIM ALL
    
    FUNCTION PBMAIN () AS LONG
    
        LOCAL allFile   AS STRING
        LOCAL iRecs     AS LONG
        LOCAL ii        AS LONG
        LOCAL hFile     AS DWORD
        LOCAL oneRec    AS STRING
        LOCAL t         AS DOUBLE
        LOCAL Pathname  AS STRING
    
        DISPLAY OPENFILE %HWND_DESKTOP, , , "", "", "All files (*.*)|*.*", "", "", %OFN_PATHMUSTEXIST TO Pathname
        IF Pathname = "" THEN
            EXIT FUNCTION
        END IF
    
        hFile = FREEFILE
        OPEN Pathname FOR INPUT AS #hFile LEN = &h00800
            t = TIMER
            FILESCAN #hFile, RECORDS TO iRecs
            DIM arrOfRecs(iRecs) AS STRING
            FOR ii = 1 TO iRecs
                LINE INPUT #hFile, arrOfRecs(ii)
            NEXT ii
            ? "done individual rec read to array: " & STR$(TIMER - t)
        CLOSE #hFile
    
        hFile = FREEFILE
        OPEN Pathname FOR BINARY AS #hFile
            t = TIMER
            GET$ #hFile, LOF(#hFile), allFile
            iRecs = PARSECOUNT(allFile, $CRLF)
            DIM arrOfRecs(iRecs) AS STRING
            PARSE allFile, arrOfRecs(), $CRLF
            ? "    done whole file read to array: " & STR$(TIMER - t)
        CLOSE #hFile
    
    END FUNCTION
    Here are the results I got when using a 10.4 MB text file of lines of up to 80 characters followed by CRLF.

    Code:
    done individual rec read to array:  .281000000001921
        done whole file read to array:  .156999999999243

    Leave a comment:


  • Edwin Knoppert
    replied
    There are always two things to consider for programming.

    1)
    Do so little memory claims as possible.
    This doesn't mean size but the number of allocations.
    Redim preserve means: make a copy of the existing array in a new memory allocation.

    2)
    Diskreads, better read a large block then little parts each time.

    Leave a comment:


  • John Petty
    replied
    Originally posted by Michael Mattias View Post
    The space required for any increase or decrease in the number of elements requested subtracts from or adds to what remains available to your process.
    MCM
    Note bolding is mine.
    So you are agreeing with me!! What is your point? Who said it had anything to do with the O/S or compiler? How simple can I make it for you? If I REDIM PRESERVE to the actual array size needed then the unused portion of memory is returned to the program for other uses. Thats what your statement says, thats what Doug said and what I have said, so what is your argument?

    Leave a comment:


  • Michael Mattias
    replied
    My understanding is that Redim Preserve will return the wastefully extravagant portion of the original Redim to the usable 2 GB memory space of a program
    Then you suffer a bad case of misunderstanding.

    When using PB arrays, there is no overallocation; the compiler only allocates space for as many elements as requested by the programmer in the DIM or REDIM statement.

    And regardless, all array data allocations always come out of your 2GB user limit. The space required for any increase or decrease in the number of elements requested subtracts from or adds to what remains available to your process.

    If you, the programmer, allocate more elements than you need, that is your own promiscuous behavior, not the compiler's or the operating system's.

    MCM
    Last edited by Michael Mattias; 21 Jun 2009, 11:52 AM.

    Leave a comment:


  • John Petty
    replied
    Michael
    Is this yet another profligate post by you.
    My understanding is that Redim Preserve will return the wastefully extravagant portion of the original Redim to the usable 2 GB memory space of a program.
    You really must learn all the meanings of these fancy words you have started using.
    John
    PS I am still trying to understand what my intellectual creativity has to do with having children

    Leave a comment:


  • Michael Mattias
    replied
    Other than to clean house and not be wasteful, is there any performance reason to use redim preserve?
    ???

    REDIM PRESERVE is not used to clean house or eliminate resource profligacy, it's used to resize an array without losing the current contents.

    Leave a comment:


  • John Petty
    replied
    No it cleans up a bit of memory space thats all. The main reason I would do it is that when the array is used later you can use its Ubound so not have to worry if counter (x in this case) gets changed.

    Leave a comment:


  • Doug Ingram
    replied
    Other than to clean house and not be wasteful, is there any performance reason to use redim preserve?

    Leave a comment:


  • Doug Ingram
    replied
    Originally posted by John Petty View Post
    Doug
    I don't understand why you are useing LINE INPUT and parsing. Old school Basic has always seperated input fields with commas or CRLF combinations so why are you bringing these 8 variables per line into a string and then parsing them. Far more efficient to let basic do the parsing automatically as it reads the file. Here is a small code example assuming there are only 3 variables per line. Of course this could also be done with a single UDT array. As for the disk yes increase the buffer size.
    John,
    It's just to keep it clean visually.
    Some of the records have a time field and others have additional volume fields - so that's not so much of a speed thing.
    However, I am using the Parse$ function.
    Thanks!
    Doug

    Leave a comment:


  • John Petty
    replied
    Michael
    Some quick quotes from Microsofts own descriptions of MMF's
    One advantage to using MMF I/O is that the system performs all data transfers for it in 4K pages of data.
    As I think you pointed out thats a small input buffer.
    While no gain in performance is observed when using MMFs for simply reading a file into RAM,
    Like sequential files that are only read in entirety once.
    Since Windows NT is a page-based virtual-memory system, memory-mapped files represent little more than an extension of an existing, internal memory management component.......when a process starts, pages of memory are used to store static and dynamic data for that application. Once committed, these pages are backed by the system pagefile,
    Why would you bother backing up memory pages to the pagefile when they are only being read once.
    There is no question MMF's are an important part of the O/S but not as you keep quoting how to use them.
    John

    Leave a comment:


  • John Petty
    replied
    Originally posted by Michael Mattias View Post
    Which "myth?"

    That using MMFs is not faster than using pure sequential MCM
    Correct

    Leave a comment:


  • Michael Mattias
    replied
    Which "myth?"

    That using MMFs is not faster than using pure sequential or random access?

    Well, I guess it must be application-specific, because I have had nothing but success using MMFs.

    Minor technical correction required:
    It will of course save the entire file to the swap file ready for reuse by other applications
    That's not quite accurate, in that the page files will be used for additional access by the current process as well... which is exactly what happens if you load the file to user memory.

    Hey, MMFs are another technique.. use it or not, your choice. Me, I've always enjoyed having options.

    MCM

    Leave a comment:


  • John Petty
    replied
    Originally posted by Michael Mattias View Post
    You can also do memory-mapping; that offers the possibility of accessing the file data without the need to load it to your memory space (eg., into a PB array).

    For delimited records (a standard 'sequential text' file uses CRLF as a record delimiter) there is no real good reason to do this, but for fixed-record length files, you can use something like:

    Memory Mapped Files instead of RANDOM disk file access 5-8-04

    Very, very fast; good thing to do with big files (whatever 'big' means).

    Note that 'as written,' that demo will not handle files larger than the available system space for MMF objects, which has been in my experience somewhere between 400 and 600 Mb.

    Or, for sequential files you will be processing in a truly sequential fashion, there's also this demo:

    Memory-Mapped File Version of LINE INPUT
    MCM
    Michael
    You must stop perpetuating this myth. This occasion is probably the worst occasion I can think of to use a MMF. The O/S uses 4K of the programs (PB) memory space and then only reads the file in 4K increments, no different to setting to a 4K buffer. It will of course save the entire file to the swap file ready for reuse by other applications but in this case there are no other applications nor does he wish to read it more than once. Just how slow are you trying to make his program run?

    Leave a comment:


  • John Petty
    replied
    Doug
    I don't understand why you are useing LINE INPUT and parsing. Old school Basic has always seperated input fields with commas or CRLF combinations so why are you bringing these 8 variables per line into a string and then parsing them. Far more efficient to let basic do the parsing automatically as it reads the file. Here is a small code example assuming there are only 3 variables per line. Of course this could also be done with a single UDT array. As for the disk yes increase the buffer size.
    Code:
    FUNCTION PBMAIN () AS LONG
        LOCAL x AS LONG
        LOCAL a AS STRING
        LOCAL TheDate() AS STRING * 10
        LOCAL TheTime() AS LONG
        LOCAL TheHigh() AS SINGLE
        REDIM TheDate(1,000,000)    ' a number larger than the largest number
        'of records you will ever get
        REDIM TheTime(1,000,000)
        REDIM TheHigh(1,000,000)
        OPEN dpath$+dfile$ FOR INPUT AS #1
        'after opening the file if there is a heading line then
        LINE INPUT #1, a
        WHILE ISFALSE EOF(1)
            INPUT #1, TheDate(x), TheTime(x), TheHigh(x)    'etc too lazy to name all the fields
            INCR x
        WEND
        CLOSE 1
        REDIM PRESERVE TheDate(x - 1)
        REDIM PRESERVE TheTime(x - 1)
        REDIM PRESERVE TheHigh(x - 1)
    
    END FUNCTION

    Leave a comment:


  • Michael Mattias
    replied
    >LEN = &h00800 'I DID see speed increases

    &h800 is only 2K, relatively not a heck of a lot more than than the 128 byte default.

    Don't be such a wimp; you have 2 Gb user memory to play with!

    At least go one full page size (usually 64K, use GetSystemInfo() function to get page size on target system).

    Unless I'm mistaken, you will be allocating a full page anyway, so you may as well put all of it to work.
    Last edited by Michael Mattias; 20 Jun 2009, 11:24 AM.

    Leave a comment:


  • John Gleason
    replied
    I'm sad to report I made a mistake in my timing code, but happy to report that it works fine using individual record reads! I didn't know a SHARED file disabled sequential file buffering I said I was surprised at the difference in speed of the two, but I think stunned was my actual state. I didn't understand why changing the record length had no effect. I wondered if there was a problem somewhere, but couldn't find one. Well, problem solved!

    Check out the new times. The record read is faster. I didn't time Francisco's technique, but it looks fast too.

    Code:
    done individual rec read:  2.84999999999767
    done whole file read:  3.56999999999913
    
    done individual rec read:  3.20099502459925E-12
    done whole file read:  3.20099502459925E-12
    Code:
    #COMPILE EXE
    #DIM ALL
    
    FUNCTION PBMAIN () AS LONG
       LOCAL allFile AS STRING, iRecs, ii AS LONG
       LOCAL high, t AS DOUBLE
    
       OPEN "C:\BinFil08.dat" FOR INPUT AS #2 LEN = &h00800 'I DID see speed increases with larger rec sizes.
       t = TIMER
       DO
          LINE INPUT #2, allFile
       LOOP UNTIL EOF(#2)
       ? "done individual rec read: " & STR$(TIMER - t)
       CLOSE
    
       OPEN "C:\BinFil08.dat" FOR BINARY AS #1
       t = TIMER
       GET$ #1, LOF(#1), allFile
       iRecs = PARSECOUNT(allFile, $CRLF)
       DIM arrOfRecs(iRecs) AS STRING
       PARSE allFile, arrOfRecs(), $CRLF
    
        ? "done whole file read: " & STR$(TIMER - t)
       WAITKEY$ 
    END FUNCTION
    Last edited by John Gleason; 20 Jun 2009, 11:16 AM.

    Leave a comment:


  • Michael Mattias
    replied
    Your research on whole file reading speed was simply amazing !
    I never thought that the speed difference would be that big !
    You can also do memory-mapping; that offers the possibility of accessing the file data without the need to load it to your memory space (eg., into a PB array).

    For delimited records (a standard 'sequential text' file uses CRLF as a record delimiter) there is no real good reason to do this, but for fixed-record length files, you can use something like:

    Memory Mapped Files instead of RANDOM disk file access 5-8-04

    Very, very fast; good thing to do with big files (whatever 'big' means).

    Note that 'as written,' that demo will not handle files larger than the available system space for MMF objects, which has been in my experience somewhere between 400 and 600 Mb.

    Or, for sequential files you will be processing in a truly sequential fashion, there's also this demo:

    Memory-Mapped File Version of LINE INPUT

    Just a few little somethings to keep in mind....


    MCM

    Leave a comment:

Working...
X