Announcement

Collapse
No announcement yet.

File Processing Speed - shootout results ...

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • File Processing Speed - shootout results ...

    This my way of giving back to all of you that posted.
    besides, its allways fun to see what shakes out in a head to head

    Input File = 5MB of 81 char lines derived from data available at:
    at: ftp.cme.com/pub/time (July 01 files)

    5 runs were made for each technique, the results averaged.
    The exact same number of Processing statements were used for each technique.
    If the technique put the line into an array, then a TempStr was substituted
    and then used by the MID$ statements instead of the array to avoid a second loop.

    Testing done on a 400mhz win98SE box - your mileage may vary


    There were two basic methods suggested by you guys:

    Read the whole file into a string in memory,
    three different techniques:
    1. work it with regular strings then Use MID$s
    2. work it with pointers then Use MID$s
    3. work it with a UDT (DIM AT)

    Read the file line by line,
    two techniques:
    4. read one line at a time into a String then Use MID$s
    5. read one line at a time into a UDT.

    My first attempt was Method 4.
    LineLength = 81
    GET$ #100, LineLength, Buf
    SymbStr = MID$(Buf, 1, 3)
    ' etc (7 more MID$ and function calls like DateToJulian)
    This took an average of 5.53 seconds


    Next I learned that you can input to a UDT! Method 5
    TYPE CMETimeAndSales
    SymbStr AS STRING*3 ' 1 - 3
    etc
    END TYPE
    GLOBAL TnSLine AS CMETimeAndSales

    GET #100,,TnSLine
    SymbStr = TnSLine.SymbStr
    ' etc (some function calls like DateToJulian)
    This took an average of 5.42 secs


    The I decided to try loading the whole file into memory and PARSE$ out a line at a time.
    Method 1.
    OPEN Filename FOR BINARY AS #1 LEN = 32768
    GET$ #1, LOF(1), Buf
    CLOSE #1

    After Studying Scots code I used
    REPLACE $LF WITH "|" IN a$ ' not sure why ?
    N = TALLY(A$, "|") ' Count line feeds.
    FOR L = 1 TO N ' Get N lines in total
    TempStr = PARSE$(A$, "|", L) ' Extract line from M to K-1. Length = K - M.
    SymbStr = MID$(TempStr, 1, 3)
    ' etc (7 more MID$ and function calls like DateToJulian)
    It could not get throught the 5Mb file even in 5mins. It took 45secs for a 500k file.
    (I expect the time taken to rise exponentially with file size cos PARSE$ starts
    at the beginning each time)


    Then there is the INSTR() method 1 technique:
    Erik:
    K = 0
    N = TALLY(A$, CHR$(10)) ' Count line feeds.

    FOR L = 1 TO N ' Get N lines in total
    M = K + 1 ' Starting point of search for each successive line
    K = INSTR(M, A$, CHR$(10)) ' Find next line feed
    TempStr = MID$(A$, M, K - M) ' Extract line from M to K-1. Length = K - M.
    SymbStr = MID$(TempStr, 1, 3)
    ' etc (7 more MID$ and function calls like DateToJulian)
    NEXT
    This took an average of 4.52 secs

    Then you guys posted some clever variations of pointer techniques Method 2.
    Borje:
    p1 = 1
    Letter = STRPTR(Buf) 'point Letter to beginning of string
    FOR I = 1 TO LOF(1)
    IF @Letter = 10 THEN 'Letter's value = 10, $LF
    TempStr = MID$(Buf, p1, I - p1) 'pick out line to array element
    SymbStr = MID$(TempStr, 1, 3)
    ' etc (7 more MID$ and function calls like DateToJulian)
    p1 = I + 1 'store position + 1 for the LF we skip
    END IF
    INCR Letter 'set byte pointer to next char
    This took an average of 4.45 secs


    jc: (very similar to above)
    Length = -1 ' for first line
    Start = 1 ' for first line
    bPtr = STRPTR(S1) ' S1 = Buf
    DO WHILE NOT @bPtr = 0
    IF @bPtr = 10 THEN
    TempStr = MID$(S1, Start, Length)
    Start = Start + Length + 2
    Length = 0
    INCR bPtr
    SymbStr = MID$(TempStr, 1, 3)
    ' etc (7 more MID$ and function calls like DateToJulian)
    ELSE
    INCR Length
    END IF
    INCR bPtr
    LOOP
    I had to modify this a little. Length must= -1 on entry into the loop other wise
    it is 1 char out on lines 2-n. I also removed the second IF statement inside the loop.
    This took an average of 4.45 secs (there was a wider variance - 4.28 to 4.56secs)

    Semen:
    p = STRPTR(Buf): pp = p + LEN(Buf): @pp = 10
    n = 0 ' number of lines
    WHILE p < pp AND Done = 0
    b = p: WHILE @b <> 10: INCR b: WEND: @b = 0
    SymbStr = MID$(@p, 1, 3)
    ' etc (7 more MID$ and function calls like DateToJulian)
    p = b + 1
    INCR n
    WEND
    This took an average of 5.51 secs (there was a wider variance - 5.33 to 5.66secs)

    Finally there is the Ultra clever and downright sneaky cool Method 3:
    This language amazes me. Whoever thought this up is a genius.
    Overlay a UDT format on a string in memory and just read out the elements - brilliant!
    John:
    TYPE CMETimeAndSales
    SymbStr AS STRING*3 ' 1 - 3
    ' etc
    END TYPE
    GLOBAL TnSLine AS CMETimeAndSales, TnsFile() AS CMETimeAndSales
    '
    FileLength = LOF(100)
    LineLength = LEN(TnSLine) ' Length of a line
    LinesInFile = FileLength / LineLength
    GET$ 100, FileLength, Buf ' Read whole file into a string
    CLOSE #100
    Start = STRPTR(Buf)
    REDIM TnSFile(LinesInFile-1) at Start ' Got to use REDIM cos its called many times
    FOR n = 0 TO LinesInFile-1
    SymbStr = TnSFile(n).SymbStr ' SP
    ' etc (Some more function calls like DateToJulian)
    This took an average of 4.1 secs. A clear winner with the added advantage that all the data
    is avaiable in the array.


    Conclusion:
    The only problem with Method 3 is that all files must contain the same line length.
    Unfortunatly it turns out that there at least two different line lengths I need to work with
    so I am forced to use a pointer Method 2 style
    This will at least ensure that if they change the line length in the future
    (but leave the position of the elements the same) , my prog will still work.


    Final Note:
    If the purpose is to load the file into an array, and subsequently process that array,
    then the poiner methods would need an additional loop to go through the array. The UDT
    method would not.

    If any of you would like all the code I can e-mail it no prob.

    Thank you all soooooooo much for posting and not just sitting back and watching.


    ------------------
    Kind Regards
    Mike



    [This message has been edited by Mike Trader (edited July 17, 2001).]

  • #2
    Just a small note: since I guess you don't really need the whole line
    at once, you can save some time by skipping that part and extract the
    parts directly. In my example, instead of doing:
    Code:
      TempStr = MID$(Buf, p1, I - p1) 'pick out line to array element
      SymbStr = MID$(TempStr, 1, 3)
    .. skip TempStr and go directly to:
    Code:
      SymbStr  = MID$(Buf, p1, 3)
      SymbStr2 = MID$(Buf, p1 + 3, 3) 'whatever length is..

    ------------------

    Comment


    • #3
      Mike,

      I dug up the algo I posted a long time ago, tweaked it slightly to do what
      you are after and tested it. It tests on my PIII at about half a second. It
      reads the disk file, optionally replaces a ascii 10 with 13,10 and then
      reads the file into an array.

      The algo uses a numer of intrinsic basic functions LEFT$/REPLACE/TALLY which
      are plenty fast enough where they are used. Be careful about modifying the code
      in it as it is a bit spring loaded but it should work as an object, just plug
      it in and run it.

      Hope its useful to you.

      Regards,
      [email protected]

      In your header file,
      Code:
          GLOBAL txt$()
          DECLARE FUNCTION LoadArray(fname$) as LONG
      The test code to write a test file and to read it into an array.
      Code:
                Case  50
                  tc& = GetTickCount()
                  rv& = LoadArray("testfile.txt")
                  MessageBox hWnd,ByCopy str$(GetTickCount - tc&),ByCopy str$(rv&)+" lines",%MB_OK
                  MessageBox hWnd,ByCopy txt$(19288),"Test result",%MB_OK
                  erase txt$()
        
                Case  51
                  cntr& = 0
                  Open "testfile.txt" for Output as #1
                    Do
                      Print #1, "1234567890123456789012345678901234567890"+_
                                "1234567890123456789012345678901234567890"
                      ! inc cntr&
                    Loop while cntr& < 100000
                  Close #1
      The algorithm,
      Code:
        '##########################################################################
        
        FUNCTION LoadArray(fname$) as LONG
        
            #REGISTER NONE
        
            LOCAL ln      as LONG
            LOCAL lfc     as LONG   ' line feed count
            LOCAL src     as LONG
            LOCAL dst     as LONG
            LOCAL cntr    as LONG
            LOCAL max_str as LONG
            LOCAL rf      as LONG
            LOCAL reg     as LONG
            LOCAL a$
            LOCAL lf$
        
            lf$ = chr$(13,10)
            max_str = 1024              ' make as big as you like
        
            Open fname$ for binary as #1
              ln = lof(1)
              Get$ #1, ln, a$
            Close #1
        
          ' ----------------------------------------------
          ' test if the string is ascii 10 delimited only
          ' ----------------------------------------------
            If instr(left$(a$,1024),chr$(13)) = 0 Then
              Replace chr$(10) with chr$(13,10) in a$
            End If
        
          ' ------------------------------
          ' append a LF at end if missing
          ' ------------------------------
            If right$(a$,2) <> lf$ Then
              a$ = a$ + lf$               ' append a CRLF pair
            End If
        
            src = StrPtr(a$)            ' string address
        
            b$ = space$(max_str)
            dst = StrPtr(b$)
        
            lfc = tally(a$,lf$)
        
            redim txt$(lfc)   ' make string array with correct number of members
            
            ! mov esi, src
            ! mov edi, dst
            ! mov ebx, esi
            ! add ebx, ln
            ! mov cntr, 0
            ! mov rf, 0
        
          liSt:
            ! inc rf                    ; use to get length each string
            ! mov al, [esi]
            ! inc esi
            ! cmp al, 13                ; if its a CR
            ! je liLoad                 ; load b$ into string array
            ! mov [edi], al
            ! inc edi
            ! cmp esi, ebx
            ! jne liSt
        
            ! jmp liOut
        
          liLoad:
            ! inc esi                   ; don't bother to read LF
            ! push ebx
            txt$(cntr) = left$(b$,rf)   ' load b$ into array
            ! inc cntr                  ; increment array index
            ! mov rf, 0                 ; zero string length counter
            ! pop ebx
            ! mov edi, dst              ; reset address of b$ to beginning
            ! cmp esi, ebx              ; make sure its not the last byte
            ! jne liSt                  ; go back to start
        
          liOut:
        
            FUNCTION = lfc
        
        END FUNCTION
        
        ' #########################################################################
      ------------------
      hutch at movsd dot com
      The MASM Forum

      www.masm32.com

      Comment


      • #4
        Mike --
        You wrote about my variant
        > This took an average of 5.51 secs (there was a wider variance - 5.33 to 5.66secs)

        %kt in my variant means no. of cycles.
        Because it was to fast I made a test 100 (!) times and on my PC
        (PIII-800, Win2000) it required 12 ms for 0.5 Mb.
        So it's possible to expect 120-130 ms for 5 Mb file.
        Ok, your PC is two times slowly - 250 ms.
        But where other 5 sec ?
        Show your code.




        ------------------
        E-MAIL: [email protected]

        Comment


        • #5
          Borje,

          Very true that would speed it up.
          I left the transfer to TempStr in there to be fair across the board
          most of these functions produce an array with the lines in it
          the step of putting the data into the array is simulated with TempStr.

          I did leave out a second loop for all these functions where you would
          go through the array element by element to pull out the data.
          THis would slow up these methods further, giving the DIM AT method
          a bigger advantage.

          Semen,
          I sent you code. PLs let me know what you think

          Hutch
          Thx for the assembler
          It came in at a very respectable average of 4.22 secs
          If i could figure out how to replace
          ! cmp al, 13 ' IF its a CR
          with
          ! cmp al, 10 ' IF its a LF
          ad still get it to work I could then leave out the REPLACE that you
          do at the start and I think it qould be even faster


          What I am most surprised by is that Eriks TALLY/INSTR function runs so fast
          I suspect that is because you guys at PB wrote a killer TALLY and INSTR function
          in assembler and what it lacks in other areas it makes up for with that.



          ------------------
          Kind Regards
          Mike

          Comment


          • #6
            Mike,

            Of the methods you have available, if you can determine the
            sentence length, then allocating the array at the data's starting
            address will always be the fastest as it does not scan the data
            at all.

            It is easy enough to determine the length IF the sentences are all
            the same length, just use INSTR for chr$(10).

            It seems that the code that you are processing the loaded array with
            is something like 3 times slower that loading the array, I suggest
            that this is the area that you need to address to get the loading
            and processing up to speed.

            The ASM algo will work on variable length sentences and its speed
            is reasonable but if you can set up the loading of the array with a
            known sentence length, you can avoid scanning the string at all which
            will be faster.

            What I don't know is what you are doing with the sentence after it
            is loaded. If you can give us some idea, there may be a way to speed
            it up some as it seems to be the bottleneck at the moment.

            Regards,

            [email protected]

            ------------------
            hutch at movsd dot com
            The MASM Forum

            www.masm32.com

            Comment


            • #7
              For clarity I parse each line based on "|" intead of $CRLF, so when I did a replace of $CRLF it was to ensure th ere were none in there.
              That line of code should have been removed...

              Good point on the timing etc, although I ran about 500 reboot simulations, (still under a 2k file however) and it showed no adverse time requirement on my system to read and parse 5 items per line out...
              So, that would lead me to believe it would all be dependent on what you are doing....
              It sounds as if pointers might still be the fastest method.

              Here, you can't top this for speed to my knowledge, had help wiht this of course, it will decode a 10m file in about 6 seconds if I recall:

              Code:
              St is read from the file as binary, get #f, LOF(f),St
              '
              
              fromEncPos = 0
              p = StrPtr(St)
              pp = StrPtr(St)
              For x = 0 To Len(St)-1
                  'c = ASC(toEnc, x)
                  c = @p[x]
                  i = (i + 1) Mod 256
                  j = (j + s(i)) Mod 256
                  Swap s(i), s(j)
                  t = (s(i) + s(j)) Mod 256
                  k = s(t)
              
                  @pp[fromEncPos] = c Xor k
                  Incr fromEncPos
              Next
              ------------------
              Scott
              Scott Turchin
              MCSE, MCP+I
              http://www.tngbbs.com
              ----------------------
              True Karate-do is this: that in daily life, one's mind and body be trained and developed in a spirit of humility; and that in critical times, one be devoted utterly to the cause of justice. -Gichin Funakoshi

              Comment


              • #8
                Hey Scott, wouldn't this be a tad quicker:
                Code:
                 
                Dim p  As Byte Ptr
                Dim pp As Byte Ptr
                 
                fromEncPos = 0
                p = StrPtr(St)
                pp = StrPtr(St)
                For x = 0 To Len(St)-1
                    'c = ASC(toEnc, x)
                    'c = @p[x]
                    c = @p
                    !inc p
                    i = (i + 1) Mod 256
                    j = (j + s(i)) Mod 256
                    Swap s(i), s(j)
                    t = (s(i) + s(j)) Mod 256
                    k = s(t)
                    '@pp[fromEncPos] = c Xor k
                    @pp = (c Xor k)
                    !inc pp
                       'Incr fromEncPos
                Next
                [This message has been edited by Ron Pierce (edited July 19, 2001).]

                Comment


                • #9
                  Mike,

                  Thanks for your testing of the various methods. I am as surprised
                  as you are by the speed of INSTR. The code behind this is indeed
                  extremely effective. Is it using a pointer like method internally
                  or what? The efficiency and effectiveness of the PowerBasic compiler
                  build-in functions are most fascinating and very useful.
                  Best wishes with your project.

                  Erik



                  ------------------

                  Comment

                  Working...
                  X