Announcement

Collapse
No announcement yet.

Tested Boyer Moore algo

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Tested Boyer Moore algo

    I would like to thank everone who has been patient enough through the
    development of this algorithm, I was unable to get any decent reference
    material on Bob Boyer's original design, the original is paper only
    archive so I have had to work out the logic from scratch based on the
    scant amount of detailed technical data around and keep testing it until
    it worked on all cases.

    The basics of the algorithm are that it scans the pattern being searched
    for backwards until it finds a mismatch, if the character is within the
    characters that are in the pattern, it calculates what is called a "Good
    suffix shift" but if the character that causes the mismatch is not within
    the pattern, it performs what is called a "bad character shift".

    Code:
                        |
        source  => xxxxxYxxxxxYxxxxxZ
        pattern => xxxxxZ
                        |
    
    The character Y is not in the search pattern and you cannot get a match by
    scanning back any further so the algo shifts the start position to the next
    position past that character and starts the reverse compare again.
                        |
        source  => xxxxxYxxxxxYxxxxxZ
        pattern =>       xxxxxZ
                        |
    The secret with a Boyer Moore search ago is the capacity make shifts of more
    than one character, in normal mixed characters as in binary or text there is
    a reasonable chance of getting a shift of the full length of the pattern
    which is a lot faster than scanning every character in the pattern.

    The Good Suffix Shift is derived from a table that is constructed before the
    search begins that fills every position with the length of the pattern, then
    overwrites the specific character position with the shift for its position in
    the pattern.

    Code:
    pattern => algorithm
      
               87654321     good suffix shift
      
               123456789    bad char shift
    By comparing the mismatch with the character in the table, it is easy to
    determine if the character is within the pattern or not, if its not, a "bad
    character shift" is performed, if it is, a "Good Suffix Shift" is calculated.

    Where I got into trouble was in calculating the Good Suffix Shift, there are
    some situations where you have repeat characters prefixed by a different
    character, EG Xyyyyyy. Because the table is overwritten at the same location
    with repeat characters, the "x" has the shift value of 1 where the "Y" has a
    value of 6 and this made the shift I was using go past the correct position
    with search patterns of this type in some instances.

    The solution was to calculate the number of matches that had been made in the
    compare loop and subtract that from the shift, check that it was not zero or
    less, set the shift at 1 if it was and then add the shift to the register
    that was the current position counter.

    The code design is based on testing of a number of alternative layouts, after
    the different calculation was added and with some instruction reordering, the
    additional 5 instructions only slowed it down by about 1 or 2 percent on the
    PIII 600 I write on and other variations did not effect its speed on the PIII
    much at all but the AMD that I also test with showed a reasonable range of
    speed difference so I optimised it to the AMD as there was no real difference
    on the PIII.

    The test code below is what I have used to test the algorithm, it shows after
    hundreds of thousands of searches that there are no misses so I think the
    algorithm is now safe for general purpose searching. The code,
    Code:
        Set_Suffix_Shift:
          ! add eax, ecx                ; add CMP count
          ! sub eax, cval               ; sub loop count
          ! cmp eax, 0                  ; test eax for zero
          ! jg  Add_Suffix_Shift
          ! mov eax, 1                  ; minimum shift is 1
      
        Add_Suffix_Shift:
          ! add esi, eax                ; add suffix shift
    can be replaced with a single instruction,
    Code:
        ! inc esi
    which produces one of the simplified variations of the Boyer Moore algo but
    it is measurably slower that the complete version.

    A little pure PowerBASIC for your pleasure.

    Regards,

    [email protected]

    test code
    ~~~~~~~~~
    Code:
          Open "win32api.inc" for Binary as #1
            Get$ #1,lof(1),a$
          Close #1
      
          rf&     = 0
          spos&   = 0
          ln&     = 8
          misses& = 0
      
          Do
            b$ = mid$(a$,spos&,ln&)
            ! inc spos&
            ! inc ln&
            If ln& > 240 Then ln& = 8
      
            rv& = BMBinSearch(0,StrPtr(a$),len(a$),StrPtr(b$),len(b$))
      
            If rv& = -1 Then
              ! inc misses&
            End If
      
            SetWindowText hWnd,ByCopy str$(rv&)+" "+str$(rf&)+" "+str$(misses&)
            ! inc rf&
          Loop while rf& < 700000
    Working Function
    ~~~~~~~~~~~~~~~~

    Code:
      ' #########################################################################
      
      FUNCTION BMBinSearch(ByVal startpos as LONG, _
                           ByVal lpSource as LONG,ByVal srcLngth as LONG, _
                           ByVal lpSubStr as LONG,ByVal subLngth as LONG) as LONG
      
          #REGISTER NONE
      
          LOCAL eLen    as DWORD
          LOCAL cval    as DWORD
          LOCAL shift_table as STRING * 1024
      
          ! cmp subLngth, 1
          ! jg OKsize
          ! mov eax, -2                 ; string too short, must be > 1
          ! jmp Cleanup
        OKsize:
      
          ! mov esi, lpSource
          ! add esi, srcLngth
          ! sub esi, subLngth
          ! mov eLen, esi            ; set Exit Length
      
        ' ----------------------------------------
        ' load shift table with value in subLngth
        ' ----------------------------------------
          ! mov ecx, 256
          ! mov eax, subLngth
          ! lea edi, shift_table
          ! rep stosd
      
        ' ----------------------------------------------
        ' load decending count values into shift table
        ' ----------------------------------------------
          ! mov ecx, subLngth           ; SubString length in ECX
          ! dec ecx                     ; correct for zero based index
          ! mov esi, lpSubStr           ; address of SubString in ESI
          ! lea edi, shift_table
      
          ! xor eax, eax
      
        Write_Shift_Chars:
          ! mov al, [esi]               ; get the character
          ! inc esi
          ! mov [edi+eax*4], ecx        ; write shift for each character
          ! dec ecx                     ; to ascii location in table
          ! jnz Write_Shift_Chars
      
        ' -----------------------------
        ' set up for main compare loop
        ' -----------------------------
          ! mov ecx, subLngth
          ! dec ecx
          ! mov cval, ecx
      
          ! mov esi, lpSource
          ! mov edi, lpSubStr
          ! add esi, startpos           ; add starting position
      
          ! mov edx, subLngth           ; pattern length in edx
      
          ! jmp Cmp_Loop
      
      ' *********************** Loop Code ***************************
      
        Set_Suffix_Shift:
          ! add eax, ecx                ; add CMP count
          ! sub eax, cval               ; sub loop count
          ! cmp eax, 0                  ; test eax for zero
          ! jg  Add_Suffix_Shift
          ! mov eax, 1                  ; minimum shift is 1
      
        Add_Suffix_Shift:
          ! add esi, eax                ; add suffix shift
      
        Pre_Cmp:
          ! cmp esi, eLen               ; test exit length
          ! ja  No_Match
          ! xor eax, eax                ; reset EAX
          ! mov ecx, cval               ; reset counter in compare loop
      
        Cmp_Loop:
          ! mov al, [esi+ecx]
          ! cmp al, [edi+ecx]           ; cmp characters in ESI / EDI
          ! jne Get_Shift               ; if not equal, get next shift
          ! dec ecx
          ! jns Cmp_Loop
      
          ! jmp Match
      
        Get_Shift:
          ! mov eax, shift_table[eax*4] ; get char shift value
          ! cmp eax, edx                ; is eax pattern length ?
          ! jne Set_Suffix_Shift        ; if not, jump to Calc_Suffix_Shift
          ! lea esi, [esi+ecx+1]        ; add bad char shift
          ! jmp Pre_Cmp
      
      ' ***************************************************************
      
        Match:
          ! sub esi, lpSource           ; sub source from ESI
          ! mov eax, esi                ; put length in eax
          ! jmp Cleanup
      
        No_Match:
          ! mov eax, -1                 ; set value for no match
      
        Cleanup:
      
          ! mov FUNCTION, eax
      
      END FUNCTION
      
      ' ########################################################################

    ------------------
    hutch at movsd dot com
    The MASM Forum - SLL Modules and PB Libraries

    http://www.masm32.com/board/index.php?board=69.0

    #2
    Steve,

    Thank for your writeup on the algo, and your new testcode.

    I will try out the new one in a moment!



    - Nathan.

    Comment


      #3
      Only home for a short while. Saw this one and did a quick test, using
      same routines as before, 80KB text, 3000 phrases. BMBinSearch clocked
      in just about twice as fast as INSTR, same number of hits. Seems to work
      fine. Must say, impressive work, Steve. You definitely are a genious
      at this.

      Have to go, will look more closely at it when I get back.


      ------------------

      Comment


        #4
        Steve,

        I tested your new BMBinSearch function today. Works great!


        Before i ran the new function, i ran several benchmarks on a 500kb file,
        these were retruning highly fluctuating speeds of 500-1000mb/sec.

        Then i copied in the new function over the previous one, and it now seem to
        run more smoothly at ~800mb/sec. Are you aware of this?

        It was correctly finding strings from 40bytes to 256bytes..

        I will be keeping this function in the code, to see how it tests up.
        So far so good!

        - Nathan

        Comment


          #5
          Steve's code should qualify as "code of the year" IMHO. I took the liberty
          to add and alter some and create a bmINSTR out of it. Test with array of
          3000 phrases, 2-47 bytes long, on 80 KB text shows it to be more than twice
          as fast as plain INSTR on doing repeated search for all occurrences, and it
          seems to find all. The BM algo works best with long strings, but even from
          3-byte long search it seems to be faster than INSTR. Therefore, I added
          check for length and use INSTR < 3 bytes, else bmINSTR does the work.

          When doing single search, just to see if a string is there, it wasn't any
          faster though. Tested in POFFS engine, but INSTR was just as fast. Following
          is a fraction slower than original code, but can work as easy replacement for
          INSTR, since you can use same syntax.
          Code:
          '¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
          ' INSTR replacement, very fast when doing repeated search for long strings
          ' in long strings. Original code by Steve Hutchesson.
          ' Adjusted to work as direct replacement for INSTR by Borje Hagsten.
          ' No negative startpos for backwards search though, sorry.
          '
          ' startpos can range from 1 to LEN(MainStr) - LEN(SearchStr). 
          ' Returned position is 1-based, just like INSTR. Return is zero if not found.
          '¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
          FUNCTION bmINSTR(ByVal startpos AS LONG, m AS STRING, s AS STRING) AS LONG
            #REGISTER NONE  
            LOCAL eLen     as LONG, cval     as LONG
            LOCAL lpSource as LONG, lnSource as LONG, lpSearch as LONG, lnSearch as LONG
            LOCAL shift_table as STRING * 1024
           
            ! mov esi, s              ; store Search string ptr and len
            ! mov esi, [esi]
            ! cmp esi, 0              ; if edi is zero - no search string
            ! je Cleanup2             ; then get out
            ! mov lpSearch, esi
            ! mov esi, [esi-4]
            ! mov lnSearch, esi
           
            ! cmp lnSearch, 3         ; check search len
            ! jb ShortWordScan        ; if shorter than 3, use INSTR instead
           
            ! mov esi, m              ; store Main string ptr and len
            ! mov esi, [esi]
            ! cmp esi, 0              ; if esi is zero - no search string
            ! je Cleanup2             ; then get out
            ! mov lpSource, esi
            ! mov esi, [esi-4]
            ! mov lnSource, esi
           
            ! cmp startpos, 0         ; check startpos
            ! je OKsize2              ; if zero, ok
            ! dec startpos            ; else assume > 0 - decrease, since bmINSTR is 0-based
           
            OKsize2:
              ! mov esi, lpSource
              ! add esi, lnSource
              ! sub esi, lnSearch
              ! mov eLen, esi            ; set Exit Length
           
              ' ----------------------------------------
              ' load shift table with value in lnSearch
              ' ----------------------------------------
              ! mov ecx, 256
              ! mov eax, lnSearch
              ! lea edi, shift_table
              ! rep stosd
           
              ' ----------------------------------------------
              ' load decending count values into shift table
              ' ----------------------------------------------
              ! mov ecx, lnSearch           ; SubString length in ECX
              ! dec ecx                     ; correct for zero based index
              ! mov esi, lpSearch           ; address of SubString in ESI
              ! lea edi, shift_table
              ! xor eax, eax
           
            Write_Shift_Chars2:
              ! mov al, [esi]               ; get the character
              ! inc esi                     ; next one
              ! mov [edi+eax*4], ecx        ; write shift for each character
              ! dec ecx                     ; to ascii location in table
              ! jnz Write_Shift_Chars2
           
              ' -----------------------------
              ' set up for main compare loop
              ' -----------------------------
              ! mov ecx, lnSearch
              ! dec ecx
              ! mov cval, ecx
           
              ! mov esi, lpSource
              ! mov edi, lpSearch
              ! add esi, startpos           ; add starting position
           
              ! mov edx, lnSearch           ; pattern length in edx
              ! jmp Cmp_Loop2
           
            '*********************** Loop Code ***************************
            Set_Suffix_Shift2:
              ! add eax, ecx                ; add CMP count
              ! sub eax, cval               ; sub loop count
              ! cmp eax, 0                  ; test eax for zero
              ! jg  Add_Suffix_Shift2
              ! mov eax, 1                  ; minimum shift is 1
           
            Add_Suffix_Shift2:
              ! add esi, eax                ; add suffix shift
           
            Pre_Cmp2:
              ! cmp esi, eLen               ; test exit length
              ! ja  No_Match2
              ! xor eax, eax                ; reset EAX
              ! mov ecx, cval               ; reset counter in compare loop
           
            Cmp_Loop2:
              ! mov al, [esi+ecx]
              ! cmp al, [edi+ecx]           ; cmp characters in ESI / EDI
              ! jne Get_Shift2              ; if not equal, get next shift
              ! dec ecx
              ! jns Cmp_Loop2
           
              ! jmp Match2
           
            Get_Shift2:
              ! mov eax, shift_table[eax*4] ; get char shift value
              ! cmp eax, edx                ; is eax pattern length ?
              ! jne Set_Suffix_Shift2       ; if not, jump to Calc_Suffix_Shift
              ! lea esi, [esi+ecx+1]        ; add bad char shift
              ! jmp Pre_Cmp2
              ' ***************************************************************
           
            Match2:
              ! sub esi, lpSource           ; sub source from ESI
              ! mov eax, esi                ; put length in eax
              ! inc eax                     ; adjust for 1-based return
              ! jmp Cleanup2                ; exit
           
            No_Match2:
               ! mov eax, 0                  ; set value for no match
           
            Cleanup2:
              ! mov FUNCTION, eax           ; return 1-based position
              EXIT FUNCTION
           
          ShortWordScan:                     ' if search len is < 3
            FUNCTION = INSTR(startpos, m, s)
           
          END FUNCTION

          ------------------

          Comment


            #6
            Borje,

            May come in handy sometime in the future

            BTW: What's the POFFS engine? I'm new to PB, and aren't familar with that
            term yet..

            - Nath.

            Comment


              #7
              Nathan,

              Thanks for testing this algo, it makes sense that its speed will
              be a bit more even as it has a slightly longer instruction path
              so it is less susceptible to the normal task switching than the
              earlier version. I have found that very short loops are more
              effected by other loads on the processor than slightly longer
              code.

              Borje,

              Your variation looks fine, it does make it easier to use than the
              direct addressing in the original version. Your testing show some
              hardware difference with the minimum string length, on both of my
              boxes the threshold for a byte scanner was about 6 characters in
              pattern length with the BM starting to be faster over that length.

              Generally a byte scanner is faster on shorter strings because it
              does not have the startup overhead that a BM has, a BM algo must
              create and load a 256 entry table and rewrite parts of the table for
              the characters in the pattern so it throws away about 300 character
              compares in setup time to a normal byte scanner.

              This means there is a minimum string length for it to be efficient
              in terms of speed as well as a minimum pattern length but if both
              of these considerations are met, a BM algo has and advantage in terms
              of the shift length it can make which makes it faster in most instances.

              What I have tried to do with this version is address the weakness of
              a BM algo which occurs in terms of last character match count. If there
              is a high frequency of last character of the pattern in the source
              data, the BM does more work, in a classic byte scanner, the first
              character frequency effects the speed and that type of algo has to
              be tuned so that its branching path is short enough to recover quickly.

              Regards,

              [email protected]

              ------------------
              hutch at movsd dot com
              The MASM Forum - SLL Modules and PB Libraries

              http://www.masm32.com/board/index.php?board=69.0

              Comment


                #8
                String Searching over Small Alphabets

                We propose a new string searching procedure inspired
                by the Boyer–Moore algorithm.

                Comment

                Working...
                X
                😀
                🥰
                🤢
                😎
                😡
                👍
                👎