Announcement

Collapse
No announcement yet.

Test search algorithm

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test search algorithm

    I have pasted in below a working version of a Boyer Moore string
    matching algorithm that is testing OK here. A Boyer Moore string
    matching algorithm differs from most search functions in that it
    requires a table to be loaded dynamically before the algorithm is
    run which reduces its suitability for short recursive searches.

    Its advantage is that once the table is loaded, it has a substantial
    speed advantage over more conventional search function if the
    data being searched for is over about 6 characters in length. This
    makes it suitable for searching very large bodies of information
    such as complete disk partitions, very large files etc ...

    What I need if anyone has the time is some independent testing as
    it appears to be working OK here on the range of test material I
    have run it on.

    The technique requires 2 functions to be run, pass the string being searched
    for to the "create_shift_table" function which creates the table
    in OLE string memory as its return value.

    The second function is the Boyer Moore search algorithm which takes the
    following parameters,

    Code:
    ByVal lp_shift as LONG, = StrPtr(shift_table$)
    ByVal startpos as LONG, = 0 based offset to start search from
    ByVal lpsource as LONG, = StrPtr(source$)
    ByVal lnsource as LONG, = len(source$)
    ByVal lptarget as LONG, = StrPtr(search$)
    ByVal lntarget as LONG  = len(search$)
    Apart from the address of the shift table which must be a string
    address, the function will search any address range, not just
    string data. The design of two seperate function is so that different
    types of searches can be performed, this table is set up for binary
    search so it will only do case sensitive search in string data.

    I am led to believe that J Moore coded Bob Boyer's original idea
    in PDP-10 assembler back in 1977 so coding this algo in a high
    performance language in x86 assembler is probably a reasonable
    way of doing justice to its design.

    Any feedback would be appreciated.

    Regards,

    [email protected]

    Code:
      ' #########################################################################
      
      FUNCTION create_shift_table(srch$) as STRING
      
          #REGISTER NONE
      
          LOCAL chTable as STRING     ' string to build table in
          LOCAL lpTable as LONG       ' address of table
          LOCAL lpSrch  as LONG       ' address of search string
          LOCAL ssl     as LONG       ' search string length
      
          chTable = space$(1024)      ' allocate 256 * 4 bytes
          ssl     = len(srch$)        ' get search string length
          lpSrch  = StrPtr(srch$)     ' get search string address
          lptable = StrPtr(chTable)   ' get table address
      
        ' ------------------------------------------------------
        ' fill DWORD table with the length of the search string
        ' ------------------------------------------------------
      
          ! cld                   ; clear direction flag to read forward
      
          ! mov ecx, 256
          ! mov edi, lpTable
          ! mov eax, ssl          ; write search string length to table
          ! rep stosd
      
          ! mov ecx, ssl          ; put search string length in ECX
          ! dec ecx
          ! mov esi, lpSrch       ; address search string
          ! mov edi, lpTable      ; table address
      
          ! xor eax, eax          ; zero EAX
      
        ' -----------------------------------------------
        ' write the shift vaue for each search character
        ' into the correct position in the shift table
        ' -----------------------------------------------
        Write_Shift_Characters:
          ! mov al, [esi]         ; get the character
          ! inc esi
          ! mov [edi+eax*4], ecx  ; write shift for each character
          ! dec ecx               ; to ascii location in table
          ! jnz Write_Shift_Characters
      
          FUNCTION = chTable
      
      END FUNCTION
      
      ' #########################################################################
      
      FUNCTION BMsearch(ByVal lp_shift as LONG, _
                        ByVal startpos as LONG, _
                        ByVal lpsource as LONG, _
                        ByVal lnsource as LONG, _
                        ByVal lptarget as LONG, _
                        ByVal lntarget as LONG) as LONG
      
          #REGISTER NONE
      
          LOCAL resetpos as LONG
          LOCAL buffrend as LONG
      
          ! mov esi, lpsource     ; source address
          ! mov edi, lpTarget     ; target address
      
          ! mov ecx, lntarget
          ! dec ecx               ; correct back one to read last character pair
      
          ! add esi, ecx          ; set esi to first end compare
          ! add edi, ecx          ; set edi to first end compare
      
          ! mov resetpos, edi     ; set reset position for shift loop
      
          ! mov ebx, esi          ; use EBX as shift location counter
      
        ' ------------------------------
        ' set end of source in buffrend
        ' ------------------------------
          ! mov edx, lpsource     ; source address
          ! add edx, lnsource     ; add source length
          ! inc edx               ; end of buffer + 1
          ! mov buffrend, edx
      
          ! mov edx, lp_shift     ; use EDX as address of shift table
      
          ! dec lntarget          ; dec target length for shift loop
          ! xor eax, eax          ; clear EAX
      
          ! add esi, startpos     ; move ESI forward by starting position
      
        bmMainLoop:
          ! mov al, [esi]
          ! dec esi
          ! dec edi
          ! cmp al, [edi+1]
          ! jne bmSetShift
          ! dec ecx
          ! jnz bmMainLoop        ; if ECX = 0, match has occurred
      
          ! jmp bmCalcLength      ; exit on match
      
        bmSetShift:
          ! add ebx, [edx+(eax*4)]; add shift to location counter
          ! mov esi, ebx          ; copy location counter + shift to ESI      
          ! mov ecx, lntarget     ; reset ECX to target length
          ! mov edi, resetpos     ; reset EDI to end char of target string
          ! cmp esi, buffrend     ; test end of buffer condition
          ! jbe bmMainLoop        ; if not end of buffer, do next string compare
      
          ! jmp bmNoMatch         ; else set no match and exit
      
        bmCalcLength:
          ! sub esi, lpsource
          ! mov eax, esi
          ! jmp bmexitpos
      
        bmNoMatch:
          ! mov eax, -1
      
        bmexitpos:
          ! mov FUNCTION, eax
      
      END FUNCTION
      
      ' #########################################################################
    ------------------


    [This message has been edited by Steve Hutchesson (edited April 02, 2001).]
    hutch at movsd dot com
    The MASM Forum

    www.masm32.com

  • #2
    Since I have no idea of what kind of data you are searching for,
    I will give you an example of my particular kind of "search
    engine" and you can take it from there if you want.

    SearchString$ = {something$}
    Open "somefile.txt" for binary as #1
    get$ #1,16000,t1$

    do
    get$ #1,16000 t2$

    test$ = t1$ + t2$
    p = instr(SearchString$,test$)
    if p > 0 then
    REM Do something with it
    end if

    if lof(1) = loc(1) then exit loop

    t1$ = t2$
    loop
    close #1

    This search technique, to me at least, is very fast. It's also
    short, sweet and to the point. This may give you some kind of
    comparison. Hope it helps.

    ------------------


    [This message has been edited by Mel Bishop (edited April 02, 2001).]
    There are no atheists in a fox hole or the morning of a math test.
    If my flag offends you, I'll help you pack.

    Comment


    • #3
      Mel,

      I have a test setup for 3 search methods, the Boyer Moore, a
      binary search algo I posted recently and the standard INSTR.

      The test times for each on a single string of 34 meg with the
      text being searched for at the end is as follows,

      Boyer Moore = 315 milliseconds
      BinSearch = 450 milliseconds
      INSTR = 745 milliseconds

      The Boyer Moore starts with the overhead of loading the shift
      table and is still considerably faster. It is an algorithm
      originally designed for binary search so the comparison is fair
      as all 3 methods are capable of searching the full ascii character
      range.

      It is not well suited for recursive search with different strings
      of 6 characters or under because of the overhead of creating the
      table but if you have to search every file on a HDD for a virus
      pattern, perform pattern matching on very large files or perform
      text search on a very large number of files, it has generally
      outperformed anything else.

      What I need if anyone has the time and does this type of work where
      speed truly matters is some independent testing. If you don't do
      this type of patern matching, use INSTR, it works well in most
      normal instances.

      Regards,

      [email protected]

      ------------------
      hutch at movsd dot com
      The MASM Forum

      www.masm32.com

      Comment


      • #4
        Steve --
        In BinSearch you compare bytes from left to right.
        Did you try simply to change an order (from right to left) ?


        ------------------
        E-MAIL: [email protected]

        Comment


        • #5
          Binary searches don't search like that, they split in two and jump thus binary, they are not exaustive searches. you check the middle of the range and if higher then search the middle of the top half etc to you find your target

          Binary Searches only work on sorted data, as do the slightly superior Interpolation searches (which guess at the distance of the jump).

          My ASM is pretty poor but I am interested in this alogorithm and would like to hear more about how it works, do you have any links of spec you are coding from?

          Cheers


          ------------------
          Paul Dwyer
          Network Engineer
          Aussie in Tokyo

          [This message has been edited by Paul Dwyer (edited April 03, 2001).]

          Comment


          • #6
            Paul,

            The Boyer-Moore search algorithm was part of my college curriculum
            "ages" ago, you should be able to find a description of it fairly easily.

            Steve, next time I run into a stumbling block and need some time out,
            I'll give your algorithm a run on my Win98 SE...



            ------------------
            Bernard Ertl
            Bernard Ertl
            InterPlan Systems

            Comment


            • #7
              Semen,

              A reverse compare in many instances in the BinSearch algo would
              make the subloop exit faster as it would have less comparison to
              perform to find a mismatch and it is a variation I will try when
              I have a bit more time but it does not address the area where
              the real speed difference is occurring.

              The Boyer Moore algo is faster in its forward read speed that a
              normal byte scan and as the Boyer Moore search string gets longer
              the forward read speed gets faster as it can step forward without
              having to compare all bytes.

              Paul,

              We may just have a terminology difference here, I use binary search
              to differentiate from text search which is a different type of
              string matching algorithm in the historical data. The usage I have
              is for a search algorithm to handle the full 256 byte range.

              I could not get a copy of Bob Boyer's original paper from 1977 even
              though it is referenced in many places on the net but a google search
              on "boyer moore" will turn you up a lot of stuff on search algorithms
              including some java animations of variations and different types.

              From the info I could find, Bob Boyer's original idea had the best
              logic and it seemed that the variations (horsepool etc ...) were
              based on character count maths and the technology of ANSI C string
              handling rather than how fast it could be run through a computer.

              Bern,

              I am glad someone knew what a Boyer Moore algo was, any testing
              you have time to do will be most appreciated as I have given it
              a hammering over a wide range of things but there is always the
              risk that I have missed a circumstance where it may fail.

              Regards,

              [email protected]



              ------------------
              hutch at movsd dot com
              The MASM Forum

              www.masm32.com

              Comment


              • #8
                OK Steve, I reached a good stopping point for the day with my code
                and I tested your code.

                I used the following test program in conjuction with the Forum4.Dat
                file which Borje made for POFFS V1. The DB is roughly 12MB. I only
                tested the cases listed in the code.

                First my code, results of tests below:

                Code:
                #COMPILE EXE
                #INCLUDE "WIN32API.INC"  'For GetTickCount
                
                
                'Steve's ASM Boyer-Moore functions saved to an .INC file
                #INCLUDE "C:\PBDLL60\DOWNLO~1\BOYERM~1\SEARCH.INC"
                
                
                FUNCTION PBMAIN() AS LONG
                
                
                   'I hard coded the path to the .DAT file, you would likely need to change to use this program..
                   OPEN "C:\PBDLL60\DOWNLOADS\POFFS\FORUM4.DAT" FOR BINARY AS 1
                   GET$ 1, LOF( 1), A$
                   CLOSE 1
                
                
                   AtStartOfDB$ = "Has anyone experienced this 'bug' too?"
                   InMiddleOfDB$ = "PB/DLL and all other PowerBASIC compilers are compatible with Windows 2000."
                   NearEndOfBD$ = "You are in the wrong place for this part of this discussion  "
                   NotInDB$ = "*&^@$*&%[email protected]!VH Y^WV$*#RT$GV FO*&H!MC()&EYM(@&TGDNI&^ASRao*)^*#$"
                
                
                '   B$ = AtStartOfDB$
                '   B$ = InMiddleOfDB$
                '   B$ = NearEndOfBD$
                   B$ = NotInDB$
                
                
                   Start??? = GetTickCount
                   Table$ = create_shift_table( B$)
                   Address& = BMsearch( STRPTR( Table$), 0, STRPTR( a$), LEN( A$), STRPTR( B$), LEN( B$))
                   Fin??? = GetTickCount
                
                
                   Start2??? = GetTickCount
                   Address2& = INSTR( 1, A$, B$)
                   Fin2??? = GetTickCount
                
                
                   MSGBOX "Boyer Moore found B$ in the database at position" + STR$( Address&+1) + " in" + STR$( Fin??? - Start???) + " ticks." + $CRLF + _
                          "INSTR found B$ in the database at position" + STR$( Address2&) + " in" + STR$( Fin2??? - Start2???) + " ticks."
                
                
                END FUNCTION
                In all cases, your BMSearch returned the same result as INSTR.
                It appears to be working correctly. I would need to test searches
                starting at offset values and using search strings with CHR$(0), etc.,
                before saying it works 100% though.

                On my Win98SE I got the following results:

                Code:
                                  Boyer-Moore      INSTR
                AtStartOfDB$ ---> 0 ticks          0 ticks
                
                
                InMiddleOfDB$ --> 26-30 ticks      89-91 ticks
                
                
                NearEndOfBD$ ---> 64-66 ticks      189-195 ticks
                
                
                NotInDB$ -------> 45-50 ticks      190-201 ticks
                Thanks for the code Steve!




                ------------------
                Bernard Ertl
                Bernard Ertl
                InterPlan Systems

                Comment


                • #9
                  Hi Steve

                  the search fails on similar len searches close to one another
                  as in the following test case:

                  Code:
                  FUNCTION PBMAIN() AS LONG
                     LOCAL sSearch AS STRING
                     LOCAL sPattern AS STRING
                     LOCAL sTable AS STRING
                     LOCAL lPos AS LONG
                   
                     sSearch = "hello there1223 then what's that where1223 are you"
                      
                     sPattern = "where1223"
                      
                     sTable = create_shift_table( sPattern )
                     lPos = BMsearch( STRPTR( sTable), 0, STRPTR( sSearch), LEN( sSearch ), STRPTR( sPattern), LEN( sPattern ))
                     PRINT "BMSearch found '" + sPattern + "' at Pos: " + FORMAT$( lPos )
                      
                     PRINT "INSTR found '" + sPattern + "' at Pos: " + FORMAT$( INSTR(sSearch, sPattern) )
                   
                     WAITKEY$
                  END FUNCTION
                  ------------------

                  Comment


                  • #10
                    Florent,

                    Thanks for finding this one, failure to do last character compare
                    which was easy to fix but it caused another repeat error so I have some
                    more work to do in the testing and debugging.

                    Bern,

                    Thanks for doing the tests, I will try and get the next one a bit
                    more reliable.

                    Regards,

                    [email protected]

                    ------------------
                    hutch at movsd dot com
                    The MASM Forum

                    www.masm32.com

                    Comment

                    Working...
                    X