Announcement

Collapse
No announcement yet.

Fastest possible string search

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    a$ = PEEK$(VARPTR(ByteArray(0)), lengthofarray&)

    ------------------
    Lance
    PowerBASIC Support
    mailto:[email protected][email protected]</A>
    Lance
    mailto:[email protected]

    Comment


    • #22
      Okay, tested on relatively slow machine. ScanString is a bit faster
      and seems to work fine. Must observe results though, since they are
      zero based and returns -1 on failure, unlike INSTR's 1-based and zero
      on failure. Very interesting code, Hutch - I like it a lot.
      Code:
      ' Processor = 266 MHz
      ' MainStr   = 80KB text file.
      ' wrd()     = array of 3000 different phrases, ranging from 2 - 37 bytes long
      ' NumRecs   = UBOUND(wrd)
      '
      ' Average results from multiple scans on file
      ' result INSTR, 603 hits/scan in 6.39 sec.
      ' result myINSTR, 603 hits/scan in 6.25 sec.
      ' result ScanString, 603 hits/scan in 5.58 sec.
      ' test code as follows: (here for ScanString, but same kind was used for all)
      ' ------------------------------------------------------------------------------------
        LOCAL c AS LONG, mPtr AS LONG, sPtr AS LONG, mLen AS LONG, sLen AS LONG, _
              hit AS LONG, lRes AS LONG, tc1 AS LONG, tc2 AS LONG
        tc1  = GetTickCount()
        mPtr = STRPTR(MainStr)
        mLen = LEN(MainStr)
       
        FOR c = 0 TO Numrecs
           SearchStr = wrd(c)
           sPtr = STRPTR(SearchStr)
           sLen = LEN(SearchStr)
           lRes = ScanString(0, mPtr, mLen, sPtr, sLen)
           IF lRes > -1 THEN
              INCR hit
              DO
                 lRes = ScanString(lRes + sLen, mPtr, mLen, sPtr, sLen)
                 IF lRes > -1 THEN INCR hit
              LOOP WHILE lRes > -1
           END IF
        NEXT
       
        tc2 = GetTickCount() - tc1 
        MSGBOX STR$(hit) + "  " + FORMAT$(tc2/1000, "0.0000") + " Pos: " + FORMAT$(lRes), %MB_OK OR %MB_ICONINFORMATION, "ScanString"
        ' ------------------------------------------------------------------------------------

      ------------------

      Comment


      • #23
        Borje/Steve, i had a go with the code you kindly posted, and have had some success.

        I am inputting a 500kb file.. and searched it for 15 strings.

        The total process, per GetTickCount, took 40ms. This is MUCH faster than
        a previous implementation i did in VisualBasic using it's InStr, if i recall
        it was around 300ms..

        However much i would like to use this code. I do not believe 40ms is fast enough,
        when searching for 15 strings.

        May i ask a question?..

        Is the speed of the ScanString function affected by the length of
        the SearchStr ? ie: will using longer search strings increase or decrease
        the speed of the search ?

        I will keep fiddling with this code, and see if i can make it any more efficient.

        Thank you very much in guiding me through this.

        regards
        - Nathan.

        Comment


        • #24
          Borje,

          Thanks for testing the algo, this PIII I use has so many "unusual"
          effects when timing algorithms that it can nearly drive you nuts.

          It looks like the ScanString algo is still a bit too slow in its
          branching code so i will see if I can find a way to get its branch
          speed up a bit.

          Nathan,

          You should not have to do a conversion to a basic dynamic string
          to use the two algorithms that I have pasted in, they work directly
          in addresses and the way you load the file using the API function
          should give you the start address and the file length.

          To get your speed up, I would suggest that you make an array of the
          search patterns you wish to use and store their lengths as well so
          that this information is available when you start the scanning for
          each file. The idea is to load the search patterns at startup so that
          you have both their addresses and their lengths.

          When you load the file to scan from disk, you will have the start address
          and length and you can loop through the array of patterns with very little
          overhead once you set it up.

          The length of the pattern being searched is not that important in most
          instances, the scanners will stop testing after the first mismatch and
          continue with the main scan. If the Boyer Moore algo runs OK, it
          will actually get faster as the pattern gets longer.

          Regards,

          [email protected]

          ------------------
          hutch at movsd dot com
          The MASM Forum - SLL Modules and PB Libraries

          http://www.masm32.com/board/index.php?board=69.0

          Comment


          • #25
            Morning

            Steve,
            I have a LoadDB sub which is called in PBMAIN, this loads a currently
            small definitions file of 30 into 2 dimensions string array. One of the
            subscripts in this contains a hexidecimal search string. Still in the
            LoadDB sub, i convert these Hexidecimal strings using Chr and Val into
            normal ASCII. The search array of strings is then loaded into memory.

            Next i have a 'AnalyseFile' function. This opens the file specified in
            the 'szFile as String' parameter of the function. Then starts a For Next loop
            to analyse that file for each search string in the 2D search string array
            that was created from LoadDB.

            If i use the normal InStr function that comes with PB, this will
            analyse a 3.2mb BMP file for all 30 search strings in 651 ms.
            I used GetTickCount and a bit a math to work out a Scan Speed in Mb/sec.
            And Calculated 6Mb/sec - which is awfully slow


            Right now, i'm working with your ScanString function. But it doesn't find anything,
            even if they are valid strings as found in the file..

            Code:
             LOCAL tC AS LONG, enumI AS LONG
             LOCAL retVal AS LONG, MainStr AS STRING, SearchStr AS STRING  
            
                SearchStr = ayFileDef(enumI, 2)
                retVal = ScanString( 0, STRPTR(MainStr), LEN(MainStr), STRPTR(SearchStr), LEN(SearchStr) )
                'retVal = INSTR( 1, MainStr, SearchStr )
            I'm using the exact same ScanString as you posted.
            It will always return -1, what am i doing wrong?

            Thanks very much and bare with me

            Best,

            -Nathan.




            [This message has been edited by Nathan Evans (edited July 01, 2001).]

            Comment


            • #26
              Nathan,

              This test code below is running 100 iterations through win32api.inc in about
              700 ms on my PIII so it is searching about 75 meg in 700 ms.

              Code:
                  Open "win32api.inc" for Binary as #1
                    Get$ #1,lof(1),a$
                  Close #1
                
                  b$ = "' ** VERSION.DLL Declares" ' << almost at the end
                
                  LOCAL cnt as LONG
                  cnt = 0
                
                  tc& = GetTickCount
                
                  ap&  = StrPtr(a$)
                  aln& = len(a$)
                  bpt& = StrPtr(b$)
                  bln& = len(b$)
                
                  Do
                    rv& = ScanString(0,ap&,aln&,bpt&,bln&)
                    ! inc cnt
                  loop while cnt < 100
                
                  tc2& = GetTickCount - tc&
                
                  MessageBox hWin,ByCopy str$(rv&),ByCopy str$(tc2&), _
                             %MB_OK or %MB_ICONINFORMATION
              I would make sure that what you are passing to the "ScanString" function
              is correct. Perhaps you need a 3 dimensional array with STRING, ADDRESS and
              LENGTH so that you can load the source file and just loop through it.
              Converting each time you run a string is a lot slower, far better to have them
              all preloaded so that when you do the scan, they are ready.

              Regards,

              [email protected]

              ------------------
              hutch at movsd dot com
              The MASM Forum - SLL Modules and PB Libraries

              http://www.masm32.com/board/index.php?board=69.0

              Comment


              • #27
                Steve,

                I tried the example you posted.

                The tc2& variable would return 0 at the end of the test..
                The rv& variable would hold 622 at the end of the test..

                I changed the search string to: ' Last Update: August 22, 1999
                which is right near the top of the file, and this returned 622 also.

                Am i doing something drastically wrong??

                - Nathan

                [This message has been edited by Nathan Evans (edited July 01, 2001).]

                Comment


                • #28
                  Something is obviously wrong... post the entire source code you are using, and maybe we (or Steve!) can spot something...


                  ------------------
                  Lance
                  PowerBASIC Support
                  mailto:[email protected][email protected]</A>
                  Lance
                  mailto:[email protected]

                  Comment


                  • #29
                    Nathan,

                    Here is a quick test piece that does the type of searching you
                    require. This runs 25 iterations of win32api.inc in about 150ms
                    on my PIII with 25 different patterns searched for in the same
                    source file. About 125 meg/sec which is probably in the ball park
                    in terms of the speed you require.

                    Code:
                                  LOCAL fl as LONG
                                  redim patterns$(25)
                      
                                  Open "win32api.inc" for Binary as #1
                                    fl = lof(1)
                                    Get$ #1,fl,a$
                                  Close #1
                      
                                  patterns$(0)  = "VerQueryValue"
                                  patterns$(1)  = "GetFileVersionInfo"
                                  patterns$(2)  = "GetFileVersionInfoSize"
                                  patterns$(3)  = "VerInstallFile"
                                  patterns$(4)  = "VerFindFile"
                                  patterns$(5)  = "SystemTimeToVariantTime"
                                  patterns$(6)  = "VariantTimeToSystemTime"
                                  patterns$(7)  = "VarDateFromStr"
                                  patterns$(8)  = "WinExecError"
                                  patterns$(9)  = "SHFreeNameMappings"
                                  patterns$(10) = "DragFinish"
                                  patterns$(11) = "DragAcceptFiles"
                                  patterns$(12) = "Shell_NotifyIcon"
                                  patterns$(13) = "ShellExecute"
                                  patterns$(14) = "ShellAbout"
                                  patterns$(15) = "SHFormatDrive"
                                  patterns$(16) = "SHGetNewLinkInfo"
                                  patterns$(17) = "SHGetFileInfo"
                                  patterns$(18) = "SHFileOperation"
                                  patterns$(19) = "SHAppBarMessage"
                                  patterns$(20) = "FindExecutable"
                                  patterns$(21) = "FindEnvironmentString"
                                  patterns$(22) = "ExtractIconEx"
                                  patterns$(23) = "ExtractIcon"
                                  patterns$(24) = "ExtractAssociatedIcon"
                      
                                  aln& = len(a$)
                      
                                  LOCAL cnt as LONG
                                  cnt = 0
                      
                                  tc& = GetTickCount
                      
                                  Do
                      
                                  rv& = ScanString(0,StrPtr(a$),aln&, _
                                                   StrPtr(patterns$(cnt)),len(patterns$(cnt)))
                                ' --------------------------------------
                                  SetWindowText hWnd,ByCopy str$(rv&)
                                ' --------------------------------------
                                  ! inc cnt
                                  Loop while cnt < 25
                      
                                  tc2& = GettickCount - tc&
                      
                                  MessageBox hWin,ByCopy str$(rv&),ByCopy str$(tc2&), _
                                             %MB_OK or %MB_ICONINFORMATION
                    Regards,

                    [email protected]

                    ------------------
                    hutch at movsd dot com
                    The MASM Forum - SLL Modules and PB Libraries

                    http://www.masm32.com/board/index.php?board=69.0

                    Comment


                    • #30
                      Steve, that code seems to be working. It returns the correct location
                      for each of them strings.

                      Lance, i started an entirely new project, pasted in the ScanString function,
                      created a PBMAIN with the previous example Steve posted - but that would always return
                      622 no matter what i searched for.


                      Steve, have you tested this code with Binary files? ie: EXEs.
                      No matter what i search for in a binary file, it will return -1.

                      You are right though, these Boyer Moore algorithms are damn fast!!

                      Many thanks to you both.

                      - Nathan.

                      Comment


                      • #31
                        I changed the code to load SHELL32.DLL (OPEN "C:\WINNT5\SYSTEM32\SHELL32.DLL" FOR BINARY AS #1) and ran Steves code above... it works fine for me...

                        My K6/266 (with an MP3 player running and a bunch of other apps going too), it did all 25 searches in SHELL32.DLL in 801mS-841mS, and found all of the API names that are actually contained in SHELL32.DLL.

                        I can only suggest that you relook at your code or post it here...

                        ------------------
                        Lance
                        PowerBASIC Support
                        mailto:[email protected][email protected]</A>
                        Lance
                        mailto:[email protected]

                        Comment


                        • #32
                          I played with this a little further and the following code is the fruits of my efforts.

                          When run on the WIN32API.INC file, INSTR() beats the BM algorithm code every time on my Win2K PC, at a fairly consistent 1-30% (worst:best).

                          Get rid of the B$ allocation within the loops to greatly speed this up too... on my PC, the MB loop runs in 300-400ms, and the INSTR loop around 30% faster.

                          Here is the code I used:
                          Code:
                          #COMPILE EXE
                          #INCLUDE "WIN32API.INC"
                          '##########################################################################
                          
                          FUNCTION ScanString(BYVAL startpos AS LONG, _
                                                BYVAL source   AS LONG, BYVAL lnsrc AS LONG, _
                                                BYVAL pattern  AS LONG, BYVAL lnpatn AS LONG) AS LONG
                          
                                #REGISTER NONE
                          
                                ! mov edi, pattern
                                ! mov al, [edi]       ; get 1st byte
                          
                                ! mov esi, source
                                ! mov ecx, lnsrc
                                ! add esi, ecx
                                ! neg ecx
                                ! add ecx, startpos
                          
                                ! dec lnpatn
                          
                                ! jmp main_loop
                          
                              ' %%%%%%%%%%%%%%%%%%%%%%%%%%%%
                          
                              pre_loop:
                                ! pop ecx
                                ! inc ecx
                          
                              main_loop:
                              ' ------------
                              ' unroll by 2
                              ' ------------
                                ! cmp al, [esi+ecx]
                                ! je pre_sub
                                ! inc ecx
                                ! jz nMatch
                          
                                ! cmp al, [esi+ecx]
                                ! je pre_sub
                                ! inc ecx
                                ! jnz main_loop
                              ' ------------
                                ! jmp nMatch
                          
                              pre_sub:
                                ! push ecx
                                ! mov edi, pattern
                                ! mov ebx, lnpatn
                          
                              sub_loop:
                                ! mov ah, [esi+ecx+1]
                                ! cmp ah, [edi+1]
                                ! jne pre_loop
                                ! inc ecx
                                ! inc edi
                                ! dec ebx
                                ! jnz sub_loop
                          
                              ' %%%%%%%%%%%%%%%%%%%%%%%%%%%%
                          
                                ! pop eax
                                ! add eax, lnsrc
                                ! jmp Outa_Here
                          
                              nMatch:
                                ! mov eax, -1
                          
                              Outa_Here:
                                ! mov FUNCTION, eax
                          
                          END FUNCTION
                          
                          ' ########################################################################
                          
                          FUNCTION PBMAIN
                              LOCAL fl AS LONG
                              REDIM patterns$(25)
                          
                              OPEN "F:\PBDLL60\WINAPI\win32api.inc" FOR BINARY AS #1
                          '    OPEN "C:\WINNT5\SYSTEM32\SHELL32.DLL" FOR BINARY AS #1
                              fl = LOF(1)
                              GET$ #1,fl,a$
                              CLOSE #1
                          
                              patterns$(0)  = "VerQueryValue"
                              patterns$(1)  = "GetFileVersionInfo"
                              patterns$(2)  = "GetFileVersionInfoSize"
                              patterns$(3)  = "VerInstallFile"
                              patterns$(4)  = "VerFindFile"
                              patterns$(5)  = "SystemTimeToVariantTime"
                              patterns$(6)  = "VariantTimeToSystemTime"
                              patterns$(7)  = "VarDateFromStr"
                              patterns$(8)  = "WinExecError"
                              patterns$(9)  = "SHFreeNameMappings"
                              patterns$(10) = "DragFinish"
                              patterns$(11) = "DragAcceptFiles"
                              patterns$(12) = "Shell_NotifyIcon"
                              patterns$(13) = "ShellExecute"
                              patterns$(14) = "ShellAbout"
                              patterns$(15) = "SHFormatDrive"
                              patterns$(16) = "SHGetNewLinkInfo"
                              patterns$(17) = "SHGetFileInfo"
                              patterns$(18) = "SHFileOperation"
                              patterns$(19) = "SHAppBarMessage"
                              patterns$(20) = "FindExecutable"
                              patterns$(21) = "FindEnvironmentString"
                              patterns$(22) = "ExtractIconEx"
                              patterns$(23) = "ExtractIcon"
                              patterns$(24) = "ExtractAssociatedIcon"
                          
                              aln& = LEN(a$)
                          
                              LOCAL cnt AS LONG
                              cnt = 0
                              b$ = ""
                          
                              tc2& = GetTickCount
                              DO
                          
                                  rv& = ScanString(0,STRPTR(a$),aln&, _
                                      STRPTR(patterns$(cnt)),LEN(patterns$(cnt)))
                                  b$ = b$ + STR$(rv&)
                                  ! inc cnt
                              LOOP WHILE cnt < 25
                              tc2& = GettickCount - tc2&
                          
                              cnt = 0
                              b$ = b$ + $CRLF
                              tc& = GetTickCount
                              DO
                                  rv& = INSTR(a$, patterns$(cnt))
                                  b$ = b$ + STR$(rv&)
                                  ! inc cnt
                              LOOP WHILE cnt < 25
                              
                              tc& = GetTickCount - tc&
                              
                              b$ = B$ + $CRLF + "INSTR/MB = " + STR$(tc&/tc2&)
                          
                              MessageBox hWin&,BYVAL STRPTR(B$),"MB=" + STR$(tc2&) + "  INSTR=" + STR$(tc&), _
                                  %MB_OK OR %MB_ICONINFORMATION
                          
                          END FUNCTION

                          ------------------
                          Lance
                          PowerBASIC Support
                          mailto:[email protected][email protected]</A>
                          Lance
                          mailto:[email protected]

                          Comment


                          • #33
                            Lance, i'm just trying to get this working in my main project.

                            I dont like using variables like: rv&
                            So i'm defining them as LOCAL.

                            Just to be clear, a & signifies a LONG right ?

                            Comment


                            • #34
                              Lance,

                              Hmm.. Once i've got my BM working, i'll be able to use both.

                              I know InStr works better with smaller files..
                              And BM with larger files.

                              So i could easily put a If statement that chooses which method to
                              use depending on the Filesize.

                              I will have to do some testing across multiple platforms to find
                              a good value as to when BM should be run.

                              Thanks.

                              Comment


                              • #35
                                Does my (abeit untidy) test code above run for you? (Obviously it will need the paths edited accordingly), but if it runs then you are half-way to getting it working in your real program.

                                Yes, the "&" type-specifier suffix means a LONG integer.

                                This thread is now getting to long and is closed... if you require more help with this item, please start a new thread and include a link to this thread.

                                Thanks!



                                ------------------
                                Lance
                                PowerBASIC Support
                                mailto:[email protected][email protected]</A>
                                Lance
                                mailto:[email protected]

                                Comment

                                Working...
                                X