I have had this algo for a while but needed it in basic string form rather than as zero terminated strings so I re-tweaked the MASM original by converting it to PB and modified it to take basic string as input and a basic string array as output.
It is reasonably fast allowing that it does a copy pass after the tokeniser to put the results into a string array. On this 3 gig quad in a single thread I keep getting 94ms for a 17 meg test file with about 130000 lines in it.
The function is written with a duplicate guard so it avoids duplicate errors.
It is reasonably fast allowing that it does a copy pass after the tokeniser to put the results into a string array. On this 3 gig quad in a single thread I keep getting 94ms for a 17 meg test file with about 130000 lines in it.
The function is written with a duplicate guard so it avoids duplicate errors.
Code:
' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤ ' Build with PBCC50 ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤ %usemacros = 1 ' exclude added code ' ---------------------------- ' change this to your own path ' ---------------------------- #include "\pbwin90\include\win32api.inc" ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤ FUNCTION PBmain as LONG LOCAL lcnt as DWORD LOCAL cntr as DWORD LOCAL tc1 AS DWORD LOCAL tc2 AS DWORD ' ------------------------------------------------------------ ' set your test file here, bigger = better for timing purposes ' ------------------------------------------------------------ Open "lt.txt" for Binary as #1 Get$ #1, lof(1), inc$ Close #1 tc1 = GetTickCount dim txt(0) as STRING ' allocate an empty string array lcnt = tokenise(inc$,txt()) ' call the tokeniser tc2 = GetTickCount StdOut str$(tc2 - tc1)+" milliseconds" StdOut "Do you wish to see the results ? y/n" lp1: k$ = inkey$ SleepEx 1,0 If lcase$(k$) = "n" Then goto quit If lcase$(k$) = "y" Then goto showit goto lp1 showit: cntr = 0 Do StdOut txt(cntr) ! add cntr, 1 ! sub lcnt, 1 Loop while lcnt <> 0 quit: erase txt() End FUNCTION ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤ #IF NOT %DEF(%fn_tokenise) ' the duplicate guard %fn_tokenise = 1 FUNCTION tokenise(src$,arr() as STRING) as DWORD ' ******************************************************************** ' ' DESCRIPTION ' ' Tokenise a text file into a dynamic string array ' ' input arguments are, ' ' 1. a multiline dynamic string delimited with CRLF pair (13 10) ' 2. an empty dynamic string array which must be dimensioned first ' ' EXAMPLE : dim MyArray(0) as STRING ' lcnt = tokenise(MySrcString$,MyArray()) ' ' NOTE : the function left trims lines of text and removes blank lines ' ' ******************************************************************** #REGISTER NONE LOCAL bcnt as DWORD LOCAL lcnt as DWORD LOCAL pTxt as DWORD LOCAL pArr as DWORD LOCAL gArr as DWORD cpy$ = src$ ' work on a copy of the source string pTxt = StrPtr(cpy$) ! mov edi, 1 ' set counter to 1 in case of no trailing CRLF ! mov esi, pTxt ! sub esi, 1 ' ------------------------------------------------ ' count line feeds to calculate pointer array size ' ------------------------------------------------ lbl1: ! add esi, 1 ! movzx edx, BYTE PTR [esi] ! test edx, edx ' test for terminator ! jz lbl2 ! cmp edx, 10 ' test for line feed ! jne lbl1 ! add edi, 1 ' lf count in EDI ! jmp lbl1 lbl2: ' -------------------- ' multiply result by 4 ' -------------------- ''' ! lea edi, [0+edi*4] ' for non PIV hardware ! add edi, edi ' PIV specific ! add edi, edi ! mov bcnt, edi ' -------------------------- ' allocate the pointer array ' -------------------------- gArr = GlobalAlloc(%GMEM_FIXED or %GMEM_ZEROINIT,bcnt) ! mov edi, gArr ' copy allocated memory address into EDI ! mov esi, pTxt ! xor eax, eax ' zero arg counter ! sub esi, 1 ! jmp ftrim ' --------------------------------- #align 4 terminate: ! mov BYTE PTR [esi], 0 ' terminate end of current line ftrim: ' scan to find next acceptable character ! add esi, 1 ! movzx edx, BYTE PTR [esi] ' zero extend byte ! test edx, edx ' test for zero terminator ! jz lout ! cmp edx, 32 ! jbe ftrim ' scan again for 32 or less ' ================= ! mov [edi], esi ' write current location to pointer ! add edi, 4 ' set next pointer location ! add eax, 1 ' increment arg count return value ' ================= ttrim: ' scan to find the next CR or LF ! add esi, 1 ! movzx edx, BYTE PTR [esi] ' zero extend byte ! cmp edx, 13 ! jg ttrim ' short loop on normal case ! je terminate ! cmp edx, 10 ' extra test for ascii 10 ! je terminate ! test edx, edx ! jnz ttrim ' loop back if not zero, IE TAB. ' --------------------------------- lout: ! mov lcnt, eax redim arr(lcnt) as STRING ' redimension the array to the required line count dim zarr(lcnt) as ASCIIZ PTR at gArr ' overlay an ASCIIZ array over gArr ! mov esi, lcnt bcnt = 0 ' reuse bcnt as the array index cpyit: arr(bcnt) = @zarr(bcnt) ' copy zero terminated data to dynamic string array ! add bcnt, 1 ! sub esi, 1 ! jnz cpyit erase zarr() ' erase overlayed array GlobalFree gArr ' free allocated global memory FUNCTION = lcnt ' return array member count END FUNCTION #ENDIF ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Comment