Announcement

Collapse
No announcement yet.

New Boyer Moore search algorithm

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • New Boyer Moore search algorithm

    The last Boyer Moore algo I had a go at failed on repeat sequences of
    characters and it stemmed from trying to write the algorithm with only
    the "good suffix" shift table. I hunted up enough information to write
    a dual heuristic version that does both the "bad character" shift and
    the "good suffix" shift and this one has been testing for weeks with no
    problems and it is faster as well.

    It is an algorithm with a reasonably narrow usage, it was originally
    designed back in 1977 for very high speed pattern matching where the
    overhead of constructing the shift table did not matter. With normal
    string search requirements, INSTR is easier to use and more flexible.

    If you have the need to search very large blocks of information, this
    algorithm is well suited to the task. It has the characteristic of
    getting faster as the search pattern gets longer. It starts to get
    faster than a normal search algorithm once the pattern to be searched
    for is over 6 characters long. With long search patterns, (70 - 80 characters)
    I am getting over 300 meg/sec on my PIII 600 so it is reasonably fast.

    A little piece of pure PowerBASIC for your pleasure.

    [email protected]

    Code:
      ' #########################################################################
      
      FUNCTION BMBinSearch(ByVal startpos as LONG, _
                           ByVal lpSource as LONG,ByVal srcLngth as LONG, _
                           ByVal lpSubStr as LONG,ByVal subLngth as LONG) as LONG
      
          #REGISTER NONE
      
          LOCAL cval    as DWORD
          LOCAL ExitLen as DWORD
          LOCAL shift_table as STRING * 1024  ' 256 DWORDs
      
          ! cmp subLngth, 1
          ! jg tooShort
          ! mov eax, -2                 ; string too short, must be > 1
          ! jmp Cleanup
        tooShort:
      
          ! mov esi, lpSource
          ! add esi, srcLngth
          ! sub esi, subLngth
          ! mov ExitLen, esi            ; set Exit Length
      
        ' ----------------------------------------
        ' load shift table with value in subLngth
        ' ----------------------------------------
          ! mov ecx, 256
          ! mov eax, subLngth
          ! lea edi, shift_table
          ! rep stosd
      
        ' ----------------------------------------------
        ' load decending count values into shift table
        ' ----------------------------------------------
          ! mov ecx, subLngth           ; SubString length in ECX
          ! dec ecx                     ; correct for zero based index
          ! mov esi, lpSubStr           ; address of SubString in ESI
          ! lea edi, shift_table
      
          ! xor eax, eax
      
        Write_Shift_Chars:
          ! mov al, [esi]               ; get the character
          ! inc esi
          ! mov [edi+eax*4], ecx        ; write shift for each character
          ! dec ecx                     ; to ascii location in table
          ! jnz Write_Shift_Chars
      
        ' -----------------------------
        ' set up for main compare loop
        ' -----------------------------
          ! mov ecx, subLngth
          ! dec ecx
          ! mov cval, ecx
      
          ! mov esi, lpSource
          ! mov edi, lpSubStr
          ! mov ebx, esi                ; EBX as location pointer
          ! add esi, startpos           ; add starting position
      
        Cmp_Loop:
          ! xor eax, eax                ; zero EAX
          ! mov al, [esi+ecx]
          ! cmp al, [edi+ecx]           ; cmp characters in ESI / EDI
          ! jne Set_Shift               ; if not equal, get next shift
          ! dec ecx
          ! jns Cmp_Loop
      
        Match:
          ! sub esi, lpSource           ; sub source from ESI
          ! mov eax, esi                ; put length in eax
          ! jmp Cleanup
      
        Set_Shift:
          ! mov edi, lpSubStr           ; restore sub string address
          ! mov edx, shift_table[eax*4] ; get char shift value
          ! cmp edx, subLngth           ; is it pattern length ?
          ! jne Suffix_Shift            ; if not, jump to Suffix_Shift
      
        Bad_Char_Shift:
          ! lea ebx, [ebx+ecx+1]        ; add bad char shift
          ! mov ecx, cval               ; reset counter in compare loop
          ! mov esi, ebx
          ! cmp esi, ExitLen
          ! jle Cmp_Loop
      
          ! mov eax, -1
          ! jmp Cleanup
      
        Suffix_Shift:
          ! add ebx, edx                ; add suffix shift
          ! mov ecx, cval               ; reset counter in compare loop
          ! mov esi, ebx
          ! cmp esi, ExitLen
          ! jle Cmp_Loop
      
          ! mov eax, -1                 ; set value for no match
      
        Cleanup:
      
          ! mov FUNCTION, eax
      
      END FUNCTION
      
      ' ########################################################################
    ------------------
    hutch at movsd dot com
    The MASM Forum

    www.masm32.com

  • #2
    This is a gem Steve, thanks very much -
    Zippety Software, Home of the Lynx Project Explorer
    http://www.zippety.net
    My e-mail

    Comment


    • #3
      Steve could you (or somebody) post demo source on how to call your BM function? And when would you use it instead of INSTR? You said it can search for patterns... what is a pattern - anything like this - "*.exe" ?
      Thanks! I did a bit of Boyer-Moore stuff in my old pascal days at school but I've forgotten it all


      ------------------
      -

      Comment


      • #4
        There is very interesting Shift-And algorithm (Udi Manber & Sun Wu; published in 1992 - see www.arizona.edu).

        I translated first part of it (exact comparasion) to PB.
        Should be very fast (of course, it's necessary to convert to inline Asm).
        Limitation: length of pattern is 1 .. 32 (enough for search in dictionaries).

        Code:
           #Compile Exe
           #Dim All
           #Register None
           #Include "Win32Api.Inc"
        
           Function ExactShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long
              
             Dim i As Dword, j As Dword, r As Dword, m As Dword, CVTbl(255) As Dword
        
             m = Len(Pattern)
             j = 1
             For i = 1 To m
                CVTbl(Asc(Pattern, i)) = CVTbl(Asc(Pattern, i)) Or j
                If i <> m Then j = j + j
             Next
             
             For i = StartPos To Len(Txt)
                r = (r + r + 1) And CVTbl(Asc(Txt, i))
                If (r And j) Then Function = i - m + 1: Exit For
             Next
        
           End Function
        
           $Text = "Something Semen Matusovski Something"
           $Pattern = "Matusov"
           %StartPos = 7
           
           Function PbMain
              If Instr(%Startpos, $Text, $Pattern) = ExactShiftAND(%StartPos, $Text, $Pattern) Then MsgBox "Equal" Else MsgBox "Not equal"
           End Function
        ------------------
        E-MAIL: [email protected]

        Comment


        • #5
          Wayne,

          The algorithm is primarily for searching for BYTE sequences within larger bodies
          of information, it can perform text search with no problems but because of its
          primary binary usage, it requires the address of both the source memory and the
          memory of the pattern. It also requires the length of both the source and the
          pattern.

          To use it you need the BYTE or STRING data in memory and have its address,

          Code:
              LOCAL slen as LONG
              LOCAL src  as LONG
              LOCAL ln   as LONG
              LOCAL pat  as LONG
              LOCAL pln  as LONG
              LOCAL spos as LONG
          
              src  = StrPtr(txt$)  ; source BYTE or text data address
              ln   = len(txt$)     ; this would usually be supplied when you open the file
              pat  = StrPtr(pattern$)  ; pattern address
              pln  = len(pattern$) ; pattern length
              spos = 0             ; where to start in the file (0 based offset)
          
              slen = BMBinSearch(spos,src,ln,pat,pln)
          Typically it is used for things like searching a full drive for a virus pattern,
          searching a very large data base file for text or index keys, I have heard of
          some exotic uses of searching for prime numbers in big files but its strength
          is in searching very large bodies of data for a specific BYTE or STRING pattern.

          In performance terms, it starts to be faster than normal byte scanner type
          search algorithms at about 6 characters in length and if you are searching for
          long patterns, it gets much faster as it skips larger blocks of BYTE data on
          a mismatch. The other factor is that it constructs a 256 entry table at startup
          so it pays a penalty on short strings.

          For most of my normal search requirements I use the INSTR function because it
          is faster on short patterns and on shorter strings. A Boyer Moore search algo
          has its advantage on long linear searches.

          Regards,

          [email protected]

          ------------------
          hutch at movsd dot com
          The MASM Forum

          www.masm32.com

          Comment


          • #6
            Semen,

            It is an interesting looking algorithm, I have not digested how
            it works but there are many specialised applications for search
            algorithms, the dictionary search appears to be one of them.

            I would be interested to see what it performed like when coded
            in assembler, there should be some very good speed gains if it
            was written carefully.

            Regards,

            [email protected]

            ------------------
            hutch at movsd dot com
            The MASM Forum

            www.masm32.com

            Comment


            • #7
              I rewrote to inline assembler and found that Instr works very strange.
              If to change String$(1000, "M 1234567890") to something else (not M as first letter), Instr will be faster in some times.
              Actually, it's necessary to take real data, but it looks that in general Instr works better.

              Probably, my assembler code is in-effective. But I think that main reason is algorithm as it is.
              Like I wrpte, I "translated" first variant of this algorithm.
              Second (main) variant is designed to find non-exact string (one, two ... letters difference).
              Code (and time) is very similar to first variant.
              Hope that in this case ShiftAnd will demonstrate own advantages.

              Code:
                 #Compile Exe
                 #Dim All
                 #Register None
                 #Include "Win32Api.Inc"
              
                 Function ExactShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long
                    #Register None
                    Dim sCVTbl As Static String * 1024, Tmp1 As Static Dword, Tmp2 As Static Dword
              
                    ! MOV EDX, Pattern ' VarPtr(Pattern)
                    ! MOV EDI, [EDX]   ' StrPtr(Pattern)
                    ! MOV ECX, EDI
                    ! SUB ECX, 4
                    ! MOV ECX, [ECX]   ' Len(Pattern)
                    ! SUB ECX, 1
                    ! MOV Tmp2, ECX
                    ! ADD ECX, EDI
              
                    ! LEA EBX, sCVTbl
                    ! MOV Tmp1, EBX
                    ! MOV EAX, 1
                    
               ExactShiftAND_Lb1:
                    ! MOV EDX, 0
                    ! MOV DL, [EDI]
                    ! ADD EDX, EDX
                    ! ADD EDX, EDX
                    ! ADD EDX, Tmp1
                    ! MOV EBX, [EDX]
                    ! OR  EBX, EAX
                    ! MOV [EDX], EBX
                    ! CMP EDI, ECX
                    ! JGE ExactShiftAND_Lb2
                    ! INC EDI
                    ! ADD EAX, EAX
                    ! JMP ExactShiftAND_Lb1
                   
               ExactShiftAND_Lb2:
                    ! MOV EDX, Txt     ' VarPtr(Txt)
                    ! MOV EDI, [EDX]   ' StrPtr(Txt)
                    ! MOV ECX, EDI
                    ! SUB ECX, 4
                    ! MOV ECX, [ECX]   ' Len(Txt)
                    ! DEC EDI
                    ! ADD Tmp2, EDI
                    ! ADD ECX, EDI
              
                    ! MOV EBX, StartPos
                    ! ADD EDI, [EBX]
               
                    ! MOV ESI, 0
               ExactShiftAND_Lb3:
                    ! ADD ESI, ESI
                    ! INC ESI
                   
                    ! MOV EDX, 0
                    ! MOV DL, [EDI]
                    ! ADD EDX, EDX
                    ! ADD EDX, EDX
                    ! ADD EDX, Tmp1
                    ! AND ESI, [EDX]
                    ! MOV EBX, ESI
                    ! AND EBX, EAX
                    ! JE  ExactShiftAND_Lb4
                    ! SUB EDI, Tmp2
                    ! JMP ExactShiftAND_Lb6
              
               ExactShiftAND_Lb4:
                    ! CMP EDI, ECX
                    ! JGE ExactShiftAND_Lb5
                    ! INC EDI
                    ! JMP ExactShiftAND_Lb3
                   
              ExactShiftAND_Lb5:
                    ! MOV EDI, 0
                   
              ExactShiftAND_Lb6:
                    ! MOV Function, EDI
                 End Function
              
                 Function PbMain
                    Dim Txt As String, Pattern As String, StartPos As Long
                    Txt = String$(1000, "M 1234567890") + "Semen Matusovski"
                    Pattern = "Matusov"
                    StartPos = 4
                    
                    If Instr(Startpos, Txt, Pattern) <> ExactShiftAND(StartPos, Txt, Pattern) Then MsgBox "Not equal": Exit Function
                    
                    Dim i As Long, j As Long, t1 As Single, t2 As Single
              
                    t1 = Timer
                    For i = 1 To 10000
                       j = Instr (StartPos, Txt, Pattern)
                    Next
                    t2 = Timer
                    MsgBox Str$(t2 - t1)
              
                    t1 = Timer
                    For i = 1 To 10000
                       j = ExactShiftAND (StartPos, Txt, Pattern)
                    Next
                    t2 = Timer
                    MsgBox Str$(t2 - t1)
                    
                 End Function
              ------------------
              E-MAIL: [email protected]

              Comment


              • #8
                Second variant of algorithm (difference one letter maximum).
                Code:
                   #Compile Exe
                   #Dim All
                   #Register None
                   #Include "Win32Api.Inc"
                   
                   Function ExactShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long
                      Dim i As Dword, j As Dword, k As Dword, r As Dword, m As Dword, CVTbl(255) As Dword
                      m = Len(Pattern)
                      j = 1
                      For i = 1 To m
                         CVTbl(Asc(Pattern, i)) = CVTbl(Asc(Pattern, i)) Or j
                         If i <> m Then j = j + j
                      Next
                      For i = StartPos To Len(Txt)
                         k = CVTbl(Asc(Txt, i))
                         r = (r + r + 1) And k
                         If (r And j) Then Function = i - m + 1: Exit For
                      Next
                   End Function
                
                   Function Fuzzy1ShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long
                      Dim i As Dword, j As Dword, k As Dword, r1 As Dword, r2 As Dword, m As Dword, CVTbl(255) As Dword
                      m = Len(Pattern)
                      j = 1
                      For i = 1 To m
                         CVTbl(Asc(Pattern, i)) = CVTbl(Asc(Pattern, i)) Or j
                         If i <> m Then j = j + j
                      Next
                      For i = StartPos To Len(Txt)
                         k = CVTbl(Asc(Txt, i))
                         r1 = r1 + r1 + 1
                         r2 = ((r2 + r2) And k) Or r1
                         r1 = r1 And k
                         If (r2 And j) Then Function = i - m + 1: Exit For
                      Next
                   End Function
                   
                   $Text1 = "Something Semen Matusovski Something"
                   $Text2 = "Something Semen Masusovski Something"
                   $Pattern = "Matusov"
                   %StartPos = 7
                   
                   Function PbMain
                      Dim i As Long
                      If Instr(%Startpos, $Text1, $Pattern) = ExactShiftAND(%StartPos, $Text1, $Pattern) Then MsgBox "Equal" Else MsgBox "Not equal"
                      i = Fuzzy1ShiftAND(%StartPos, $Text2, $Pattern)
                      If i > 0 Then MsgBox $Pattern,, Mid$($Text2, i, Len($Pattern))
                   End Function
                Fuzzy1ShiftAND is easy to convert to FuzzyNShiftAnd (additional variables R3, R4 ... Rn+1)

                ------------------
                E-MAIL: [email protected]

                Comment


                • #9
                  Semen,

                  I tried the URL you posted but could not find any data on the search
                  algo at all. I put the converted asm version into a test piece I have
                  here but it much slower than the ones I am working with, The Boyer
                  Moore I posted with a 17 character pattern on a 52 meg sample is
                  running at about 260 milliseconds wer this one is runing at over 2
                  seconds, do you have some available technical data to try and code it
                  from scratch ?

                  I have been playing with it and reduced the instruction count some
                  but it needs a full recode to try and get the bottlenecks out of it.

                  ==============================================================
                  LATER: I have it down under 1 second but it has a long way to go to
                  be competitive.
                  ==============================================================

                  Regards,

                  [email protected]

                  [This message has been edited by Steve Hutchesson (edited May 15, 2001).]
                  hutch at movsd dot com
                  The MASM Forum

                  www.masm32.com

                  Comment


                  • #10
                    Steve --
                    Unf. I can give exact link to russian article only http://dox.sbnet.ru:8082/publications/findstr.ru.txt
                    (at the bottom you can see C code)

                    BTW, there is not bad English description of different algorithms http://www-igm.univ-mlv.fr/~lecroq/string/node1.html

                    Yesterdsy I tried Knuth-Morris-Pratt algorithm.
                    Code:
                        #Compile Exe
                        #Register None
                        #Dim All
                        #Include "Win32Api.Inc"
                    
                        Function BM (StartPos As Long, MainString As String, MatchString As String) As Long
                    
                           Register i As Long, j As Long
                           Static m As Long, n As Long, x As Byte Ptr, y As Byte Ptr
                    
                           If StartPos < 1 Then ' Preprocessing
                    
                              m = Len(MatchString): x = StrPtr(MatchString)
                              ReDim kmp_next(0 To m - 1) As Static Long
                              i = 0: j = -1: kmp_next(0)= -1
                              While i < m
                                 While (j > -1) And (@x[i] <> @x[j]): j = kmp_next(j): Wend
                                 Incr i: Incr j
                                 If @x[i] = @x[j] Then kmp_next(i)= kmp_next(j) Else kmp_next(i)= j
                              Wend
                              n = Len(MainString): y = StrPtr(MainString)
                    
                           Else ' Searching
                    
                              i = StartPos - 1: j = 0
                              While i < n
                                 While (j > -1) And (@x[j] <> @y[i]): j = kmp_next(j): Wend
                                 Incr i: Incr j
                                 If (j >= m) Then Function = i-j + 1: Exit Function ' j=kmp_next(j)
                              Wend
                    
                           End If
                        End Function
                    
                        Function PbMain
                           Dim i As Long, Txt As String, Pattern As String, StartPos As Long
                           Dim t1 As Single, t2 As Single, k As Long
                           Txt = String$(1000, " S Matus Semen Matus 12lsalkdklsdlslsllds3456")
                           Pattern = "Matus Semen"
                    
                           t1 = Timer
                           For k = 1 To 10000
                              BM 0, Txt, Pattern ' Initialization
                              StartPos = 1
                              Do
                                 i = BM (StartPos, "", ""): If i = 0 Then Exit Do
                                 ' MsgBox Mid$(Txt, i, Len(Pattern)),, Str$(i)
                                 StartPos = i + Len(Pattern) + 1
                              Loop
                           Next
                           t2 = Timer
                           MsgBox Format$(t2 - t1, "#.####"),, "My"
                    
                           t1 = Timer
                           For k = 1 To 10000
                              StartPos = 1
                              Do
                                 i = Instr (StartPos, Txt, Pattern): If i = 0 Then Exit Do
                                 StartPos = i + Len(Pattern) + 1
                              Loop
                           Next
                           t2 = Timer
                           MsgBox Format$(t2 - t1, "#.####"),, "Instr"
                    
                        End Function
                    Instr won 5-6 : 1. Even if to rewrite to asm, very doubt that it will work faster than Instr.

                    ------------------
                    E-MAIL: [email protected]

                    Comment


                    • #11
                      Semen,

                      Interesting translation but I think the results show that the logic
                      of a KPM algo is not up to the task. It apparently has some advantages
                      on repeat patterns but generally a normal forward byte scanner
                      will outperform it.

                      The site http://www-igm.univ-mlv.fr/~lecroq/string/index.html
                      has some interesting stuff on it but unfortunately very little
                      in terms of the theory of how the different algorithms work.

                      I chose the Boyer Moore because it had very clean logic and could
                      be implemented to do the dual heuristic at the expense of an extra
                      jump. While there are a number of variations of it, most seem to
                      be based on character match count theory rather than how fast the
                      algorithm can be implemented.

                      Getting a theoretical improvement in sublinearity is of little use
                      if the overhead to do it makes it slower than a basic byte scanner
                      and it is a matter of fact that x86 processors do forward scans at
                      competitive rate.

                      I had a number of fast byte scanners to test against when I did this
                      version and I found that 1 register stall made the Boyer Moore slower
                      than a byte scanner.

                      The real problem with Boyer Moore type algos is that they must construct
                      a shift table which makes them unsuitable for short sources or patterns.

                      Of the variants, the Horsepool and turbo boyer moore types are apparently
                      reasonable performers but unless they are very well coded, a byte scanner
                      will be faster.

                      Regards,

                      [email protected]


                      ------------------
                      hutch at movsd dot com
                      The MASM Forum

                      www.masm32.com

                      Comment

                      Working...
                      X