Announcement

Collapse
No announcement yet.

Fastest possible string search

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Lance Edmonds
    replied
    Does my (abeit untidy) test code above run for you? (Obviously it will need the paths edited accordingly), but if it runs then you are half-way to getting it working in your real program.

    Yes, the "&" type-specifier suffix means a LONG integer.

    This thread is now getting to long and is closed... if you require more help with this item, please start a new thread and include a link to this thread.

    Thanks!



    ------------------
    Lance
    PowerBASIC Support
    mailto:[email protected][email protected]</A>

    Leave a comment:


  • Nathan Evans
    replied
    Lance,

    Hmm.. Once i've got my BM working, i'll be able to use both.

    I know InStr works better with smaller files..
    And BM with larger files.

    So i could easily put a If statement that chooses which method to
    use depending on the Filesize.

    I will have to do some testing across multiple platforms to find
    a good value as to when BM should be run.

    Thanks.

    Leave a comment:


  • Nathan Evans
    replied
    Lance, i'm just trying to get this working in my main project.

    I dont like using variables like: rv&
    So i'm defining them as LOCAL.

    Just to be clear, a & signifies a LONG right ?

    Leave a comment:


  • Lance Edmonds
    replied
    I played with this a little further and the following code is the fruits of my efforts.

    When run on the WIN32API.INC file, INSTR() beats the BM algorithm code every time on my Win2K PC, at a fairly consistent 1-30% (worst:best).

    Get rid of the B$ allocation within the loops to greatly speed this up too... on my PC, the MB loop runs in 300-400ms, and the INSTR loop around 30% faster.

    Here is the code I used:
    Code:
    #COMPILE EXE
    #INCLUDE "WIN32API.INC"
    '##########################################################################
    
    FUNCTION ScanString(BYVAL startpos AS LONG, _
                          BYVAL source   AS LONG, BYVAL lnsrc AS LONG, _
                          BYVAL pattern  AS LONG, BYVAL lnpatn AS LONG) AS LONG
    
          #REGISTER NONE
    
          ! mov edi, pattern
          ! mov al, [edi]       ; get 1st byte
    
          ! mov esi, source
          ! mov ecx, lnsrc
          ! add esi, ecx
          ! neg ecx
          ! add ecx, startpos
    
          ! dec lnpatn
    
          ! jmp main_loop
    
        ' %%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
        pre_loop:
          ! pop ecx
          ! inc ecx
    
        main_loop:
        ' ------------
        ' unroll by 2
        ' ------------
          ! cmp al, [esi+ecx]
          ! je pre_sub
          ! inc ecx
          ! jz nMatch
    
          ! cmp al, [esi+ecx]
          ! je pre_sub
          ! inc ecx
          ! jnz main_loop
        ' ------------
          ! jmp nMatch
    
        pre_sub:
          ! push ecx
          ! mov edi, pattern
          ! mov ebx, lnpatn
    
        sub_loop:
          ! mov ah, [esi+ecx+1]
          ! cmp ah, [edi+1]
          ! jne pre_loop
          ! inc ecx
          ! inc edi
          ! dec ebx
          ! jnz sub_loop
    
        ' %%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
          ! pop eax
          ! add eax, lnsrc
          ! jmp Outa_Here
    
        nMatch:
          ! mov eax, -1
    
        Outa_Here:
          ! mov FUNCTION, eax
    
    END FUNCTION
    
    ' ########################################################################
    
    FUNCTION PBMAIN
        LOCAL fl AS LONG
        REDIM patterns$(25)
    
        OPEN "F:\PBDLL60\WINAPI\win32api.inc" FOR BINARY AS #1
    '    OPEN "C:\WINNT5\SYSTEM32\SHELL32.DLL" FOR BINARY AS #1
        fl = LOF(1)
        GET$ #1,fl,a$
        CLOSE #1
    
        patterns$(0)  = "VerQueryValue"
        patterns$(1)  = "GetFileVersionInfo"
        patterns$(2)  = "GetFileVersionInfoSize"
        patterns$(3)  = "VerInstallFile"
        patterns$(4)  = "VerFindFile"
        patterns$(5)  = "SystemTimeToVariantTime"
        patterns$(6)  = "VariantTimeToSystemTime"
        patterns$(7)  = "VarDateFromStr"
        patterns$(8)  = "WinExecError"
        patterns$(9)  = "SHFreeNameMappings"
        patterns$(10) = "DragFinish"
        patterns$(11) = "DragAcceptFiles"
        patterns$(12) = "Shell_NotifyIcon"
        patterns$(13) = "ShellExecute"
        patterns$(14) = "ShellAbout"
        patterns$(15) = "SHFormatDrive"
        patterns$(16) = "SHGetNewLinkInfo"
        patterns$(17) = "SHGetFileInfo"
        patterns$(18) = "SHFileOperation"
        patterns$(19) = "SHAppBarMessage"
        patterns$(20) = "FindExecutable"
        patterns$(21) = "FindEnvironmentString"
        patterns$(22) = "ExtractIconEx"
        patterns$(23) = "ExtractIcon"
        patterns$(24) = "ExtractAssociatedIcon"
    
        aln& = LEN(a$)
    
        LOCAL cnt AS LONG
        cnt = 0
        b$ = ""
    
        tc2& = GetTickCount
        DO
    
            rv& = ScanString(0,STRPTR(a$),aln&, _
                STRPTR(patterns$(cnt)),LEN(patterns$(cnt)))
            b$ = b$ + STR$(rv&)
            ! inc cnt
        LOOP WHILE cnt < 25
        tc2& = GettickCount - tc2&
    
        cnt = 0
        b$ = b$ + $CRLF
        tc& = GetTickCount
        DO
            rv& = INSTR(a$, patterns$(cnt))
            b$ = b$ + STR$(rv&)
            ! inc cnt
        LOOP WHILE cnt < 25
        
        tc& = GetTickCount - tc&
        
        b$ = B$ + $CRLF + "INSTR/MB = " + STR$(tc&/tc2&)
    
        MessageBox hWin&,BYVAL STRPTR(B$),"MB=" + STR$(tc2&) + "  INSTR=" + STR$(tc&), _
            %MB_OK OR %MB_ICONINFORMATION
    
    END FUNCTION

    ------------------
    Lance
    PowerBASIC Support
    mailto:[email protected][email protected]</A>

    Leave a comment:


  • Lance Edmonds
    replied
    I changed the code to load SHELL32.DLL (OPEN "C:\WINNT5\SYSTEM32\SHELL32.DLL" FOR BINARY AS #1) and ran Steves code above... it works fine for me...

    My K6/266 (with an MP3 player running and a bunch of other apps going too), it did all 25 searches in SHELL32.DLL in 801mS-841mS, and found all of the API names that are actually contained in SHELL32.DLL.

    I can only suggest that you relook at your code or post it here...

    ------------------
    Lance
    PowerBASIC Support
    mailto:[email protected][email protected]</A>

    Leave a comment:


  • Nathan Evans
    replied
    Steve, that code seems to be working. It returns the correct location
    for each of them strings.

    Lance, i started an entirely new project, pasted in the ScanString function,
    created a PBMAIN with the previous example Steve posted - but that would always return
    622 no matter what i searched for.


    Steve, have you tested this code with Binary files? ie: EXEs.
    No matter what i search for in a binary file, it will return -1.

    You are right though, these Boyer Moore algorithms are damn fast!!

    Many thanks to you both.

    - Nathan.

    Leave a comment:


  • Steve Hutchesson
    replied
    Nathan,

    Here is a quick test piece that does the type of searching you
    require. This runs 25 iterations of win32api.inc in about 150ms
    on my PIII with 25 different patterns searched for in the same
    source file. About 125 meg/sec which is probably in the ball park
    in terms of the speed you require.

    Code:
                  LOCAL fl as LONG
                  redim patterns$(25)
      
                  Open "win32api.inc" for Binary as #1
                    fl = lof(1)
                    Get$ #1,fl,a$
                  Close #1
      
                  patterns$(0)  = "VerQueryValue"
                  patterns$(1)  = "GetFileVersionInfo"
                  patterns$(2)  = "GetFileVersionInfoSize"
                  patterns$(3)  = "VerInstallFile"
                  patterns$(4)  = "VerFindFile"
                  patterns$(5)  = "SystemTimeToVariantTime"
                  patterns$(6)  = "VariantTimeToSystemTime"
                  patterns$(7)  = "VarDateFromStr"
                  patterns$(8)  = "WinExecError"
                  patterns$(9)  = "SHFreeNameMappings"
                  patterns$(10) = "DragFinish"
                  patterns$(11) = "DragAcceptFiles"
                  patterns$(12) = "Shell_NotifyIcon"
                  patterns$(13) = "ShellExecute"
                  patterns$(14) = "ShellAbout"
                  patterns$(15) = "SHFormatDrive"
                  patterns$(16) = "SHGetNewLinkInfo"
                  patterns$(17) = "SHGetFileInfo"
                  patterns$(18) = "SHFileOperation"
                  patterns$(19) = "SHAppBarMessage"
                  patterns$(20) = "FindExecutable"
                  patterns$(21) = "FindEnvironmentString"
                  patterns$(22) = "ExtractIconEx"
                  patterns$(23) = "ExtractIcon"
                  patterns$(24) = "ExtractAssociatedIcon"
      
                  aln& = len(a$)
      
                  LOCAL cnt as LONG
                  cnt = 0
      
                  tc& = GetTickCount
      
                  Do
      
                  rv& = ScanString(0,StrPtr(a$),aln&, _
                                   StrPtr(patterns$(cnt)),len(patterns$(cnt)))
                ' --------------------------------------
                  SetWindowText hWnd,ByCopy str$(rv&)
                ' --------------------------------------
                  ! inc cnt
                  Loop while cnt < 25
      
                  tc2& = GettickCount - tc&
      
                  MessageBox hWin,ByCopy str$(rv&),ByCopy str$(tc2&), _
                             %MB_OK or %MB_ICONINFORMATION
    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • Lance Edmonds
    replied
    Something is obviously wrong... post the entire source code you are using, and maybe we (or Steve!) can spot something...


    ------------------
    Lance
    PowerBASIC Support
    mailto:[email protected][email protected]</A>

    Leave a comment:


  • Nathan Evans
    replied
    Steve,

    I tried the example you posted.

    The tc2& variable would return 0 at the end of the test..
    The rv& variable would hold 622 at the end of the test..

    I changed the search string to: ' Last Update: August 22, 1999
    which is right near the top of the file, and this returned 622 also.

    Am i doing something drastically wrong??

    - Nathan

    [This message has been edited by Nathan Evans (edited July 01, 2001).]

    Leave a comment:


  • Steve Hutchesson
    replied
    Nathan,

    This test code below is running 100 iterations through win32api.inc in about
    700 ms on my PIII so it is searching about 75 meg in 700 ms.

    Code:
        Open "win32api.inc" for Binary as #1
          Get$ #1,lof(1),a$
        Close #1
      
        b$ = "' ** VERSION.DLL Declares" ' << almost at the end
      
        LOCAL cnt as LONG
        cnt = 0
      
        tc& = GetTickCount
      
        ap&  = StrPtr(a$)
        aln& = len(a$)
        bpt& = StrPtr(b$)
        bln& = len(b$)
      
        Do
          rv& = ScanString(0,ap&,aln&,bpt&,bln&)
          ! inc cnt
        loop while cnt < 100
      
        tc2& = GetTickCount - tc&
      
        MessageBox hWin,ByCopy str$(rv&),ByCopy str$(tc2&), _
                   %MB_OK or %MB_ICONINFORMATION
    I would make sure that what you are passing to the "ScanString" function
    is correct. Perhaps you need a 3 dimensional array with STRING, ADDRESS and
    LENGTH so that you can load the source file and just loop through it.
    Converting each time you run a string is a lot slower, far better to have them
    all preloaded so that when you do the scan, they are ready.

    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • Nathan Evans
    replied
    Morning

    Steve,
    I have a LoadDB sub which is called in PBMAIN, this loads a currently
    small definitions file of 30 into 2 dimensions string array. One of the
    subscripts in this contains a hexidecimal search string. Still in the
    LoadDB sub, i convert these Hexidecimal strings using Chr and Val into
    normal ASCII. The search array of strings is then loaded into memory.

    Next i have a 'AnalyseFile' function. This opens the file specified in
    the 'szFile as String' parameter of the function. Then starts a For Next loop
    to analyse that file for each search string in the 2D search string array
    that was created from LoadDB.

    If i use the normal InStr function that comes with PB, this will
    analyse a 3.2mb BMP file for all 30 search strings in 651 ms.
    I used GetTickCount and a bit a math to work out a Scan Speed in Mb/sec.
    And Calculated 6Mb/sec - which is awfully slow


    Right now, i'm working with your ScanString function. But it doesn't find anything,
    even if they are valid strings as found in the file..

    Code:
     LOCAL tC AS LONG, enumI AS LONG
     LOCAL retVal AS LONG, MainStr AS STRING, SearchStr AS STRING  
    
        SearchStr = ayFileDef(enumI, 2)
        retVal = ScanString( 0, STRPTR(MainStr), LEN(MainStr), STRPTR(SearchStr), LEN(SearchStr) )
        'retVal = INSTR( 1, MainStr, SearchStr )
    I'm using the exact same ScanString as you posted.
    It will always return -1, what am i doing wrong?

    Thanks very much and bare with me

    Best,

    -Nathan.




    [This message has been edited by Nathan Evans (edited July 01, 2001).]

    Leave a comment:


  • Steve Hutchesson
    replied
    Borje,

    Thanks for testing the algo, this PIII I use has so many "unusual"
    effects when timing algorithms that it can nearly drive you nuts.

    It looks like the ScanString algo is still a bit too slow in its
    branching code so i will see if I can find a way to get its branch
    speed up a bit.

    Nathan,

    You should not have to do a conversion to a basic dynamic string
    to use the two algorithms that I have pasted in, they work directly
    in addresses and the way you load the file using the API function
    should give you the start address and the file length.

    To get your speed up, I would suggest that you make an array of the
    search patterns you wish to use and store their lengths as well so
    that this information is available when you start the scanning for
    each file. The idea is to load the search patterns at startup so that
    you have both their addresses and their lengths.

    When you load the file to scan from disk, you will have the start address
    and length and you can loop through the array of patterns with very little
    overhead once you set it up.

    The length of the pattern being searched is not that important in most
    instances, the scanners will stop testing after the first mismatch and
    continue with the main scan. If the Boyer Moore algo runs OK, it
    will actually get faster as the pattern gets longer.

    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • Nathan Evans
    replied
    Borje/Steve, i had a go with the code you kindly posted, and have had some success.

    I am inputting a 500kb file.. and searched it for 15 strings.

    The total process, per GetTickCount, took 40ms. This is MUCH faster than
    a previous implementation i did in VisualBasic using it's InStr, if i recall
    it was around 300ms..

    However much i would like to use this code. I do not believe 40ms is fast enough,
    when searching for 15 strings.

    May i ask a question?..

    Is the speed of the ScanString function affected by the length of
    the SearchStr ? ie: will using longer search strings increase or decrease
    the speed of the search ?

    I will keep fiddling with this code, and see if i can make it any more efficient.

    Thank you very much in guiding me through this.

    regards
    - Nathan.

    Leave a comment:


  • Borje Hagsten
    replied
    Okay, tested on relatively slow machine. ScanString is a bit faster
    and seems to work fine. Must observe results though, since they are
    zero based and returns -1 on failure, unlike INSTR's 1-based and zero
    on failure. Very interesting code, Hutch - I like it a lot.
    Code:
    ' Processor = 266 MHz
    ' MainStr   = 80KB text file.
    ' wrd()     = array of 3000 different phrases, ranging from 2 - 37 bytes long
    ' NumRecs   = UBOUND(wrd)
    '
    ' Average results from multiple scans on file
    ' result INSTR, 603 hits/scan in 6.39 sec.
    ' result myINSTR, 603 hits/scan in 6.25 sec.
    ' result ScanString, 603 hits/scan in 5.58 sec.
    ' test code as follows: (here for ScanString, but same kind was used for all)
    ' ------------------------------------------------------------------------------------
      LOCAL c AS LONG, mPtr AS LONG, sPtr AS LONG, mLen AS LONG, sLen AS LONG, _
            hit AS LONG, lRes AS LONG, tc1 AS LONG, tc2 AS LONG
      tc1  = GetTickCount()
      mPtr = STRPTR(MainStr)
      mLen = LEN(MainStr)
     
      FOR c = 0 TO Numrecs
         SearchStr = wrd(c)
         sPtr = STRPTR(SearchStr)
         sLen = LEN(SearchStr)
         lRes = ScanString(0, mPtr, mLen, sPtr, sLen)
         IF lRes > -1 THEN
            INCR hit
            DO
               lRes = ScanString(lRes + sLen, mPtr, mLen, sPtr, sLen)
               IF lRes > -1 THEN INCR hit
            LOOP WHILE lRes > -1
         END IF
      NEXT
     
      tc2 = GetTickCount() - tc1 
      MSGBOX STR$(hit) + "  " + FORMAT$(tc2/1000, "0.0000") + " Pos: " + FORMAT$(lRes), %MB_OK OR %MB_ICONINFORMATION, "ScanString"
      ' ------------------------------------------------------------------------------------

    ------------------

    Leave a comment:


  • Lance Edmonds
    replied
    a$ = PEEK$(VARPTR(ByteArray(0)), lengthofarray&)

    ------------------
    Lance
    PowerBASIC Support
    mailto:[email protected][email protected]</A>

    Leave a comment:


  • Nathan Evans
    replied
    I think before i can do anything, i need to convert this Byte Array
    into a String. As it seems none of these search algorithms will process
    a Byte Array

    I wrote a simple For Next loop using Chr$ to convert the Byte Array into a string
    but it's far too slow..

    I scoured the helpfiles for a PB equivilent of VB's StrConv, but couldn't
    find anything. You guys have any fast subs to do it?

    Thanks.

    -Nathan

    Leave a comment:


  • Steve Hutchesson
    replied
    Borje,

    If you have the time and still have the test handy, would you run this
    byte scanner through it to see how it shapes up, I only have it
    benchmarked in MASM when I was using this type of algo to test
    against the Boyer Moore ones I was working on.

    What I was trying to do with this one was bridge the gap between
    scanners and the BM type algos. It has a short main loop instruction
    path where my testing showed that the main time was taken with byte
    scanners, its branch path is reasonable but I could not se an easy way to
    shorten it. I unrolled the main loop by 2 so that there were no
    unusual effects from it being so short.

    Regards,

    [email protected]

    Code:
      '##########################################################################
      
      FUNCTION ScanString(ByVal startpos as LONG, _
                          ByVal source   as LONG, ByVal lnsrc as LONG, _
                          ByVal pattern  as LONG, ByVal lnpatn as LONG) as LONG
      
          #REGISTER NONE
      
          ! mov edi, pattern
          ! mov al, [edi]       ; get 1st byte
      
          ! mov esi, source
          ! mov ecx, lnsrc
          ! add esi, ecx
          ! neg ecx
          ! add ecx, startpos
      
          ! dec lnpatn
      
          ! jmp main_loop
      
        ' %%%%%%%%%%%%%%%%%%%%%%%%%%%%
      
        pre_loop:
          ! pop ecx
          ! inc ecx
      
        main_loop:
        ' ------------
        ' unroll by 2
        ' ------------
          ! cmp al, [esi+ecx]
          ! je pre_sub
          ! inc ecx
          ! jz nMatch
      
          ! cmp al, [esi+ecx]
          ! je pre_sub
          ! inc ecx
          ! jnz main_loop
        ' ------------
          ! jmp nMatch
      
        pre_sub:
          ! push ecx
          ! mov edi, pattern
          ! mov ebx, lnpatn
      
        sub_loop:
          ! mov ah, [esi+ecx+1]
          ! cmp ah, [edi+1]
          ! jne pre_loop
          ! inc ecx
          ! inc edi
          ! dec ebx
          ! jnz sub_loop
      
        ' %%%%%%%%%%%%%%%%%%%%%%%%%%%%
      
          ! pop eax
          ! add eax, lnsrc
          ! jmp Outa_Here
      
        nMatch:
          ! mov eax, -1
      
        Outa_Here:
          ! mov FUNCTION, eax
      
      END FUNCTION
      
      ' ########################################################################
    ------------------

    Leave a comment:


  • Nathan Evans
    replied
    I will try what you have suggested, Steve, Lance.. and get back to you
    this afternoon.



    -Nath

    Leave a comment:


  • Lance Edmonds
    replied
    Nathan, According to your snippet "szdata" is a BYTE array, so you cannot use STRPTR() on it!

    Maybe you need to post more of the code so we can see exactly what you are doing here, but you probably just need to pass it as BYVAL VARPTR(szData(0)) which will pass the address of the 1st subscript (0) of the szData() array (assuming szData's LBOUND is 0).



    ------------------
    Lance
    PowerBASIC Support
    mailto:[email protected][email protected]</A>

    Leave a comment:


  • Steve Hutchesson
    replied
    Nathan,

    Use the version I pasted in, it does not use the external shift table that the
    first one did.

    Code:
    FUNCTION BMBinSearch(ByVal startpos as LONG, _
                         ByVal lpSource as LONG,ByVal srcLngth as LONG, _
                         ByVal lpSubStr as LONG,ByVal subLngth as LONG) as LONG
    use either VarPtr() or StrPtr() to get the address of the source, depending on
    how you load the source, the source length, the address of the pattern to search
    for and the pattern length.

    Code:
    retval = BMBinSearch(0,lpSource,slen,lpPattern,plen)
    See if you get a result from this.

    Regards,

    [email protected]


    ------------------

    Leave a comment:

Working...
X