Announcement

Collapse
No announcement yet.

Text file to string array tokeniser.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Text file to string array tokeniser.

    I have had this algo for a while but needed it in basic string form rather than as zero terminated strings so I re-tweaked the MASM original by converting it to PB and modified it to take basic string as input and a basic string array as output.

    It is reasonably fast allowing that it does a copy pass after the tokeniser to put the results into a string array. On this 3 gig quad in a single thread I keep getting 94ms for a 17 meg test file with about 130000 lines in it.

    The function is written with a duplicate guard so it avoids duplicate errors.

    Code:
    ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    '                                       Build with PBCC50
    ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    
        %usemacros = 1                      ' exclude added code
      ' ----------------------------
      ' change this to your own path
      ' ----------------------------
        #include "\pbwin90\include\win32api.inc"
    
    ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    
    FUNCTION PBmain as LONG
    
        LOCAL lcnt as DWORD
        LOCAL cntr as DWORD
        LOCAL tc1  AS DWORD
        LOCAL tc2  AS DWORD
    
      ' ------------------------------------------------------------
      ' set your test file here, bigger = better for timing purposes
      ' ------------------------------------------------------------
        Open "lt.txt" for Binary as #1
          Get$ #1, lof(1), inc$
        Close #1
    
        tc1 = GetTickCount
    
        dim txt(0) as STRING                ' allocate an empty string array
        lcnt = tokenise(inc$,txt())         ' call the tokeniser
    
        tc2 = GetTickCount
    
        StdOut str$(tc2 - tc1)+" milliseconds"
    
        StdOut "Do you wish to see the results ? y/n"
      lp1:
        k$ = inkey$
        SleepEx 1,0
        If lcase$(k$) = "n" Then goto quit
        If lcase$(k$) = "y" Then goto showit
        goto lp1
    
      showit:
        cntr = 0
        Do
          StdOut txt(cntr)
          ! add cntr, 1
          ! sub lcnt, 1
        Loop while lcnt <> 0
    
      quit:
        erase txt()
    
    End FUNCTION
    
    ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    
        #IF NOT %DEF(%fn_tokenise)           ' the duplicate guard
        %fn_tokenise = 1
    
    FUNCTION tokenise(src$,arr() as STRING) as DWORD
    
      ' ********************************************************************
      '
      ' DESCRIPTION
      '
      ' Tokenise a text file into a dynamic string array
      '
      ' input arguments are,
      '
      ' 1. a multiline dynamic string delimited with CRLF pair (13 10)
      ' 2. an empty dynamic string array which must be dimensioned first
      '
      ' EXAMPLE : dim MyArray(0) as STRING
      '           lcnt = tokenise(MySrcString$,MyArray())
      '
      ' NOTE : the function left trims lines of text and removes blank lines
      '
      ' ********************************************************************
    
        #REGISTER NONE
    
        LOCAL bcnt as DWORD
        LOCAL lcnt as DWORD
        LOCAL pTxt as DWORD
        LOCAL pArr as DWORD
        LOCAL gArr as DWORD
    
        cpy$ = src$                         ' work on a copy of the source string
        pTxt = StrPtr(cpy$)
    
        ! mov edi, 1                        ' set counter to 1 in case of no trailing CRLF
        ! mov esi, pTxt
        ! sub esi, 1
      ' ------------------------------------------------
      ' count line feeds to calculate pointer array size
      ' ------------------------------------------------
      lbl1:
        ! add esi, 1
        ! movzx edx, BYTE PTR [esi]
        ! test edx, edx                     ' test for terminator
        ! jz lbl2
        ! cmp edx, 10                       ' test for line feed
        ! jne lbl1
        ! add edi, 1                        ' lf count in EDI
        ! jmp lbl1
      lbl2:
      ' --------------------
      ' multiply result by 4
      ' --------------------
      '''   ! lea edi, [0+edi*4]            ' for non PIV hardware
    
        ! add edi, edi                      ' PIV specific
        ! add edi, edi
        ! mov bcnt, edi
    
      ' --------------------------
      ' allocate the pointer array
      ' --------------------------
        gArr = GlobalAlloc(%GMEM_FIXED or %GMEM_ZEROINIT,bcnt)
    
        ! mov edi, gArr                     ' copy allocated memory address into EDI
        ! mov esi, pTxt
        ! xor eax, eax                      ' zero arg counter
        ! sub esi, 1
        ! jmp ftrim
    
      ' ---------------------------------
    
      #align 4
      terminate:
        ! mov BYTE PTR [esi], 0             ' terminate end of current line
    
      ftrim:                                ' scan to find next acceptable character
        ! add esi, 1
        ! movzx edx, BYTE PTR [esi]         ' zero extend byte
        ! test edx, edx                     ' test for zero terminator
        ! jz lout
        ! cmp edx, 32
        ! jbe ftrim                         ' scan again for 32 or less
    
      ' =================
        ! mov [edi], esi                    ' write current location to pointer
        ! add edi, 4                        ' set next pointer location
        ! add eax, 1                        ' increment arg count return value
      ' =================
    
      ttrim:                                ' scan to find the next CR or LF
        ! add esi, 1
        ! movzx edx, BYTE PTR [esi]         ' zero extend byte
        ! cmp edx, 13
        ! jg ttrim                          ' short loop on normal case
    
        ! je terminate
        ! cmp edx, 10                       ' extra test for ascii 10
        ! je terminate
        ! test edx, edx
        ! jnz ttrim                         ' loop back if not zero, IE TAB.
    
      ' ---------------------------------
    
      lout:
        ! mov lcnt, eax
    
        redim arr(lcnt) as STRING               ' redimension the array to the required line count
        dim zarr(lcnt) as ASCIIZ PTR at gArr    ' overlay an ASCIIZ array over gArr
    
        ! mov esi, lcnt
        bcnt = 0                            ' reuse bcnt as the array index
      cpyit:
        arr(bcnt) = @zarr(bcnt)             ' copy zero terminated data to dynamic string array
        ! add bcnt, 1
        ! sub esi, 1
        ! jnz cpyit
        
        erase zarr()                        ' erase overlayed array
        GlobalFree gArr                     ' free allocated global memory
    
        FUNCTION = lcnt                     ' return array member count
    
    END FUNCTION
    
        #ENDIF
    
    ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    hutch at movsd dot com
    The MASM Forum

    www.masm32.com

  • #2
    Thanks, Hutch. I can see real use for that code.

    So, is "tokenizing' basically 'hashing'?

    Comment


    • #3
      Clay,

      When you tokenise a string, you are basically chopping it up into sections where a hash table among other things is used to test duplicates and/or store return values for an input.

      Just for example if you have an include file full of many equates, you can load them into a hash table by splitting up the keyword from the second value.

      Code:
      item = 12345678
      When you look up "item" in the hash table you get as the return "12345678".

      Now once you have loaded the list of words you can do a single pass on a source and replace every source word in the list of equates with its replacement.

      There are other methods of doing this but a hash table is generally faster than a tree for example. Depends a lot on how both are written.

      The crypto guys also use a hashing algo which is usually a lot more compicated to get a very high degree of difference in the result.
      hutch at movsd dot com
      The MASM Forum

      www.masm32.com

      Comment


      • #4
        Thanks Steve!
        Must be faster than Parse function..

        Comment


        • #5
          Aslan,

          For what its worth, I think the PowerBASIC PARSE function is slightly faster from testing it in the same test piece. It tests at 78ms as against 94 on the same 17 meg file.

          It looks like the end copy loop slows it down a bit which makes sense, the original ASCIIZ version was faster because it was simpler. Note that it does a bit more work in its design as well, it left trims the beginning of each string and removes blank lines.
          Last edited by Steve Hutchesson; 11 Nov 2009, 06:19 AM.
          hutch at movsd dot com
          The MASM Forum

          www.masm32.com

          Comment


          • #6
            may be replace these commands:
            add esi,1 --> inc esi
            sub esi,1 --> dec esi

            Comment


            • #7
              Aslan,

              The main time in the algo is the basic array allocation and the copy loop, the mnemonic code is a lot faster. This code block below takes up about 65% of the algo time. Run as a zero terminated algo it hits about 30ms but with any algo that delivers a basic string array it must create the array count of basic strings from the OLE string pool.

              Basic string output makes the algo about 3 times slower but it is still reasonably fast for the work it does.

              Code:
                  redim arr(lcnt) as STRING               ' redimension the array to the required line count
                  dim zarr(lcnt) as ASCIIZ PTR at gArr    ' overlay an ASCIIZ array over gArr
              
                  ! mov esi, lcnt
                  bcnt = 0                            ' reuse bcnt as the array index
                cpyit:
                  arr(bcnt) = @zarr(bcnt)             ' copy zero terminated data to dynamic string array
                  ! add bcnt, 1
                  ! sub esi, 1
                  ! jnz cpyit
              
                  erase zarr()                        ' erase overlayed array
              hutch at movsd dot com
              The MASM Forum

              www.masm32.com

              Comment

              Working...
              X