Announcement

Collapse
No announcement yet.

New Boyer Moore search algorithm

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Steve Hutchesson
    replied
    Semen,

    Interesting translation but I think the results show that the logic
    of a KPM algo is not up to the task. It apparently has some advantages
    on repeat patterns but generally a normal forward byte scanner
    will outperform it.

    The site http://www-igm.univ-mlv.fr/~lecroq/string/index.html
    has some interesting stuff on it but unfortunately very little
    in terms of the theory of how the different algorithms work.

    I chose the Boyer Moore because it had very clean logic and could
    be implemented to do the dual heuristic at the expense of an extra
    jump. While there are a number of variations of it, most seem to
    be based on character match count theory rather than how fast the
    algorithm can be implemented.

    Getting a theoretical improvement in sublinearity is of little use
    if the overhead to do it makes it slower than a basic byte scanner
    and it is a matter of fact that x86 processors do forward scans at
    competitive rate.

    I had a number of fast byte scanners to test against when I did this
    version and I found that 1 register stall made the Boyer Moore slower
    than a byte scanner.

    The real problem with Boyer Moore type algos is that they must construct
    a shift table which makes them unsuitable for short sources or patterns.

    Of the variants, the Horsepool and turbo boyer moore types are apparently
    reasonable performers but unless they are very well coded, a byte scanner
    will be faster.

    Regards,

    [email protected]


    ------------------

    Leave a comment:


  • Semen Matusovski
    replied
    Steve --
    Unf. I can give exact link to russian article only http://dox.sbnet.ru:8082/publications/findstr.ru.txt
    (at the bottom you can see C code)

    BTW, there is not bad English description of different algorithms http://www-igm.univ-mlv.fr/~lecroq/string/node1.html

    Yesterdsy I tried Knuth-Morris-Pratt algorithm.
    Code:
        #Compile Exe
        #Register None
        #Dim All
        #Include "Win32Api.Inc"
    
        Function BM (StartPos As Long, MainString As String, MatchString As String) As Long
    
           Register i As Long, j As Long
           Static m As Long, n As Long, x As Byte Ptr, y As Byte Ptr
    
           If StartPos < 1 Then ' Preprocessing
    
              m = Len(MatchString): x = StrPtr(MatchString)
              ReDim kmp_next(0 To m - 1) As Static Long
              i = 0: j = -1: kmp_next(0)= -1
              While i < m
                 While (j > -1) And (@x[i] <> @x[j]): j = kmp_next(j): Wend
                 Incr i: Incr j
                 If @x[i] = @x[j] Then kmp_next(i)= kmp_next(j) Else kmp_next(i)= j
              Wend
              n = Len(MainString): y = StrPtr(MainString)
    
           Else ' Searching
    
              i = StartPos - 1: j = 0
              While i < n
                 While (j > -1) And (@x[j] <> @y[i]): j = kmp_next(j): Wend
                 Incr i: Incr j
                 If (j >= m) Then Function = i-j + 1: Exit Function ' j=kmp_next(j)
              Wend
    
           End If
        End Function
    
        Function PbMain
           Dim i As Long, Txt As String, Pattern As String, StartPos As Long
           Dim t1 As Single, t2 As Single, k As Long
           Txt = String$(1000, " S Matus Semen Matus 12lsalkdklsdlslsllds3456")
           Pattern = "Matus Semen"
    
           t1 = Timer
           For k = 1 To 10000
              BM 0, Txt, Pattern ' Initialization
              StartPos = 1
              Do
                 i = BM (StartPos, "", ""): If i = 0 Then Exit Do
                 ' MsgBox Mid$(Txt, i, Len(Pattern)),, Str$(i)
                 StartPos = i + Len(Pattern) + 1
              Loop
           Next
           t2 = Timer
           MsgBox Format$(t2 - t1, "#.####"),, "My"
    
           t1 = Timer
           For k = 1 To 10000
              StartPos = 1
              Do
                 i = Instr (StartPos, Txt, Pattern): If i = 0 Then Exit Do
                 StartPos = i + Len(Pattern) + 1
              Loop
           Next
           t2 = Timer
           MsgBox Format$(t2 - t1, "#.####"),, "Instr"
    
        End Function
    Instr won 5-6 : 1. Even if to rewrite to asm, very doubt that it will work faster than Instr.

    ------------------
    E-MAIL: [email protected]

    Leave a comment:


  • Steve Hutchesson
    replied
    Semen,

    I tried the URL you posted but could not find any data on the search
    algo at all. I put the converted asm version into a test piece I have
    here but it much slower than the ones I am working with, The Boyer
    Moore I posted with a 17 character pattern on a 52 meg sample is
    running at about 260 milliseconds wer this one is runing at over 2
    seconds, do you have some available technical data to try and code it
    from scratch ?

    I have been playing with it and reduced the instruction count some
    but it needs a full recode to try and get the bottlenecks out of it.

    ==============================================================
    LATER: I have it down under 1 second but it has a long way to go to
    be competitive.
    ==============================================================

    Regards,

    [email protected]

    [This message has been edited by Steve Hutchesson (edited May 15, 2001).]

    Leave a comment:


  • Semen Matusovski
    replied
    Second variant of algorithm (difference one letter maximum).
    Code:
       #Compile Exe
       #Dim All
       #Register None
       #Include "Win32Api.Inc"
       
       Function ExactShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long
          Dim i As Dword, j As Dword, k As Dword, r As Dword, m As Dword, CVTbl(255) As Dword
          m = Len(Pattern)
          j = 1
          For i = 1 To m
             CVTbl(Asc(Pattern, i)) = CVTbl(Asc(Pattern, i)) Or j
             If i <> m Then j = j + j
          Next
          For i = StartPos To Len(Txt)
             k = CVTbl(Asc(Txt, i))
             r = (r + r + 1) And k
             If (r And j) Then Function = i - m + 1: Exit For
          Next
       End Function
    
       Function Fuzzy1ShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long
          Dim i As Dword, j As Dword, k As Dword, r1 As Dword, r2 As Dword, m As Dword, CVTbl(255) As Dword
          m = Len(Pattern)
          j = 1
          For i = 1 To m
             CVTbl(Asc(Pattern, i)) = CVTbl(Asc(Pattern, i)) Or j
             If i <> m Then j = j + j
          Next
          For i = StartPos To Len(Txt)
             k = CVTbl(Asc(Txt, i))
             r1 = r1 + r1 + 1
             r2 = ((r2 + r2) And k) Or r1
             r1 = r1 And k
             If (r2 And j) Then Function = i - m + 1: Exit For
          Next
       End Function
       
       $Text1 = "Something Semen Matusovski Something"
       $Text2 = "Something Semen Masusovski Something"
       $Pattern = "Matusov"
       %StartPos = 7
       
       Function PbMain
          Dim i As Long
          If Instr(%Startpos, $Text1, $Pattern) = ExactShiftAND(%StartPos, $Text1, $Pattern) Then MsgBox "Equal" Else MsgBox "Not equal"
          i = Fuzzy1ShiftAND(%StartPos, $Text2, $Pattern)
          If i > 0 Then MsgBox $Pattern,, Mid$($Text2, i, Len($Pattern))
       End Function
    Fuzzy1ShiftAND is easy to convert to FuzzyNShiftAnd (additional variables R3, R4 ... Rn+1)

    ------------------
    E-MAIL: [email protected]

    Leave a comment:


  • Semen Matusovski
    replied
    I rewrote to inline assembler and found that Instr works very strange.
    If to change String$(1000, "M 1234567890") to something else (not M as first letter), Instr will be faster in some times.
    Actually, it's necessary to take real data, but it looks that in general Instr works better.

    Probably, my assembler code is in-effective. But I think that main reason is algorithm as it is.
    Like I wrpte, I "translated" first variant of this algorithm.
    Second (main) variant is designed to find non-exact string (one, two ... letters difference).
    Code (and time) is very similar to first variant.
    Hope that in this case ShiftAnd will demonstrate own advantages.

    Code:
       #Compile Exe
       #Dim All
       #Register None
       #Include "Win32Api.Inc"
    
       Function ExactShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long
          #Register None
          Dim sCVTbl As Static String * 1024, Tmp1 As Static Dword, Tmp2 As Static Dword
    
          ! MOV EDX, Pattern ' VarPtr(Pattern)
          ! MOV EDI, [EDX]   ' StrPtr(Pattern)
          ! MOV ECX, EDI
          ! SUB ECX, 4
          ! MOV ECX, [ECX]   ' Len(Pattern)
          ! SUB ECX, 1
          ! MOV Tmp2, ECX
          ! ADD ECX, EDI
    
          ! LEA EBX, sCVTbl
          ! MOV Tmp1, EBX
          ! MOV EAX, 1
          
     ExactShiftAND_Lb1:
          ! MOV EDX, 0
          ! MOV DL, [EDI]
          ! ADD EDX, EDX
          ! ADD EDX, EDX
          ! ADD EDX, Tmp1
          ! MOV EBX, [EDX]
          ! OR  EBX, EAX
          ! MOV [EDX], EBX
          ! CMP EDI, ECX
          ! JGE ExactShiftAND_Lb2
          ! INC EDI
          ! ADD EAX, EAX
          ! JMP ExactShiftAND_Lb1
         
     ExactShiftAND_Lb2:
          ! MOV EDX, Txt     ' VarPtr(Txt)
          ! MOV EDI, [EDX]   ' StrPtr(Txt)
          ! MOV ECX, EDI
          ! SUB ECX, 4
          ! MOV ECX, [ECX]   ' Len(Txt)
          ! DEC EDI
          ! ADD Tmp2, EDI
          ! ADD ECX, EDI
    
          ! MOV EBX, StartPos
          ! ADD EDI, [EBX]
     
          ! MOV ESI, 0
     ExactShiftAND_Lb3:
          ! ADD ESI, ESI
          ! INC ESI
         
          ! MOV EDX, 0
          ! MOV DL, [EDI]
          ! ADD EDX, EDX
          ! ADD EDX, EDX
          ! ADD EDX, Tmp1
          ! AND ESI, [EDX]
          ! MOV EBX, ESI
          ! AND EBX, EAX
          ! JE  ExactShiftAND_Lb4
          ! SUB EDI, Tmp2
          ! JMP ExactShiftAND_Lb6
    
     ExactShiftAND_Lb4:
          ! CMP EDI, ECX
          ! JGE ExactShiftAND_Lb5
          ! INC EDI
          ! JMP ExactShiftAND_Lb3
         
    ExactShiftAND_Lb5:
          ! MOV EDI, 0
         
    ExactShiftAND_Lb6:
          ! MOV Function, EDI
       End Function
    
       Function PbMain
          Dim Txt As String, Pattern As String, StartPos As Long
          Txt = String$(1000, "M 1234567890") + "Semen Matusovski"
          Pattern = "Matusov"
          StartPos = 4
          
          If Instr(Startpos, Txt, Pattern) <> ExactShiftAND(StartPos, Txt, Pattern) Then MsgBox "Not equal": Exit Function
          
          Dim i As Long, j As Long, t1 As Single, t2 As Single
    
          t1 = Timer
          For i = 1 To 10000
             j = Instr (StartPos, Txt, Pattern)
          Next
          t2 = Timer
          MsgBox Str$(t2 - t1)
    
          t1 = Timer
          For i = 1 To 10000
             j = ExactShiftAND (StartPos, Txt, Pattern)
          Next
          t2 = Timer
          MsgBox Str$(t2 - t1)
          
       End Function
    ------------------
    E-MAIL: [email protected]

    Leave a comment:


  • Steve Hutchesson
    replied
    Semen,

    It is an interesting looking algorithm, I have not digested how
    it works but there are many specialised applications for search
    algorithms, the dictionary search appears to be one of them.

    I would be interested to see what it performed like when coded
    in assembler, there should be some very good speed gains if it
    was written carefully.

    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • Steve Hutchesson
    replied
    Wayne,

    The algorithm is primarily for searching for BYTE sequences within larger bodies
    of information, it can perform text search with no problems but because of its
    primary binary usage, it requires the address of both the source memory and the
    memory of the pattern. It also requires the length of both the source and the
    pattern.

    To use it you need the BYTE or STRING data in memory and have its address,

    Code:
        LOCAL slen as LONG
        LOCAL src  as LONG
        LOCAL ln   as LONG
        LOCAL pat  as LONG
        LOCAL pln  as LONG
        LOCAL spos as LONG
    
        src  = StrPtr(txt$)  ; source BYTE or text data address
        ln   = len(txt$)     ; this would usually be supplied when you open the file
        pat  = StrPtr(pattern$)  ; pattern address
        pln  = len(pattern$) ; pattern length
        spos = 0             ; where to start in the file (0 based offset)
    
        slen = BMBinSearch(spos,src,ln,pat,pln)
    Typically it is used for things like searching a full drive for a virus pattern,
    searching a very large data base file for text or index keys, I have heard of
    some exotic uses of searching for prime numbers in big files but its strength
    is in searching very large bodies of data for a specific BYTE or STRING pattern.

    In performance terms, it starts to be faster than normal byte scanner type
    search algorithms at about 6 characters in length and if you are searching for
    long patterns, it gets much faster as it skips larger blocks of BYTE data on
    a mismatch. The other factor is that it constructs a 256 entry table at startup
    so it pays a penalty on short strings.

    For most of my normal search requirements I use the INSTR function because it
    is faster on short patterns and on shorter strings. A Boyer Moore search algo
    has its advantage on long linear searches.

    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • Semen Matusovski
    replied
    There is very interesting Shift-And algorithm (Udi Manber & Sun Wu; published in 1992 - see www.arizona.edu).

    I translated first part of it (exact comparasion) to PB.
    Should be very fast (of course, it's necessary to convert to inline Asm).
    Limitation: length of pattern is 1 .. 32 (enough for search in dictionaries).

    Code:
       #Compile Exe
       #Dim All
       #Register None
       #Include "Win32Api.Inc"
    
       Function ExactShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long
          
         Dim i As Dword, j As Dword, r As Dword, m As Dword, CVTbl(255) As Dword
    
         m = Len(Pattern)
         j = 1
         For i = 1 To m
            CVTbl(Asc(Pattern, i)) = CVTbl(Asc(Pattern, i)) Or j
            If i <> m Then j = j + j
         Next
         
         For i = StartPos To Len(Txt)
            r = (r + r + 1) And CVTbl(Asc(Txt, i))
            If (r And j) Then Function = i - m + 1: Exit For
         Next
    
       End Function
    
       $Text = "Something Semen Matusovski Something"
       $Pattern = "Matusov"
       %StartPos = 7
       
       Function PbMain
          If Instr(%Startpos, $Text, $Pattern) = ExactShiftAND(%StartPos, $Text, $Pattern) Then MsgBox "Equal" Else MsgBox "Not equal"
       End Function
    ------------------
    E-MAIL: [email protected]

    Leave a comment:


  • Wayne Diamond
    replied
    Steve could you (or somebody) post demo source on how to call your BM function? And when would you use it instead of INSTR? You said it can search for patterns... what is a pattern - anything like this - "*.exe" ?
    Thanks! I did a bit of Boyer-Moore stuff in my old pascal days at school but I've forgotten it all


    ------------------

    Leave a comment:


  • Paul Noble
    replied
    This is a gem Steve, thanks very much -

    Leave a comment:


  • Steve Hutchesson
    started a topic New Boyer Moore search algorithm

    New Boyer Moore search algorithm

    The last Boyer Moore algo I had a go at failed on repeat sequences of
    characters and it stemmed from trying to write the algorithm with only
    the "good suffix" shift table. I hunted up enough information to write
    a dual heuristic version that does both the "bad character" shift and
    the "good suffix" shift and this one has been testing for weeks with no
    problems and it is faster as well.

    It is an algorithm with a reasonably narrow usage, it was originally
    designed back in 1977 for very high speed pattern matching where the
    overhead of constructing the shift table did not matter. With normal
    string search requirements, INSTR is easier to use and more flexible.

    If you have the need to search very large blocks of information, this
    algorithm is well suited to the task. It has the characteristic of
    getting faster as the search pattern gets longer. It starts to get
    faster than a normal search algorithm once the pattern to be searched
    for is over 6 characters long. With long search patterns, (70 - 80 characters)
    I am getting over 300 meg/sec on my PIII 600 so it is reasonably fast.

    A little piece of pure PowerBASIC for your pleasure.

    [email protected]

    Code:
      ' #########################################################################
      
      FUNCTION BMBinSearch(ByVal startpos as LONG, _
                           ByVal lpSource as LONG,ByVal srcLngth as LONG, _
                           ByVal lpSubStr as LONG,ByVal subLngth as LONG) as LONG
      
          #REGISTER NONE
      
          LOCAL cval    as DWORD
          LOCAL ExitLen as DWORD
          LOCAL shift_table as STRING * 1024  ' 256 DWORDs
      
          ! cmp subLngth, 1
          ! jg tooShort
          ! mov eax, -2                 ; string too short, must be > 1
          ! jmp Cleanup
        tooShort:
      
          ! mov esi, lpSource
          ! add esi, srcLngth
          ! sub esi, subLngth
          ! mov ExitLen, esi            ; set Exit Length
      
        ' ----------------------------------------
        ' load shift table with value in subLngth
        ' ----------------------------------------
          ! mov ecx, 256
          ! mov eax, subLngth
          ! lea edi, shift_table
          ! rep stosd
      
        ' ----------------------------------------------
        ' load decending count values into shift table
        ' ----------------------------------------------
          ! mov ecx, subLngth           ; SubString length in ECX
          ! dec ecx                     ; correct for zero based index
          ! mov esi, lpSubStr           ; address of SubString in ESI
          ! lea edi, shift_table
      
          ! xor eax, eax
      
        Write_Shift_Chars:
          ! mov al, [esi]               ; get the character
          ! inc esi
          ! mov [edi+eax*4], ecx        ; write shift for each character
          ! dec ecx                     ; to ascii location in table
          ! jnz Write_Shift_Chars
      
        ' -----------------------------
        ' set up for main compare loop
        ' -----------------------------
          ! mov ecx, subLngth
          ! dec ecx
          ! mov cval, ecx
      
          ! mov esi, lpSource
          ! mov edi, lpSubStr
          ! mov ebx, esi                ; EBX as location pointer
          ! add esi, startpos           ; add starting position
      
        Cmp_Loop:
          ! xor eax, eax                ; zero EAX
          ! mov al, [esi+ecx]
          ! cmp al, [edi+ecx]           ; cmp characters in ESI / EDI
          ! jne Set_Shift               ; if not equal, get next shift
          ! dec ecx
          ! jns Cmp_Loop
      
        Match:
          ! sub esi, lpSource           ; sub source from ESI
          ! mov eax, esi                ; put length in eax
          ! jmp Cleanup
      
        Set_Shift:
          ! mov edi, lpSubStr           ; restore sub string address
          ! mov edx, shift_table[eax*4] ; get char shift value
          ! cmp edx, subLngth           ; is it pattern length ?
          ! jne Suffix_Shift            ; if not, jump to Suffix_Shift
      
        Bad_Char_Shift:
          ! lea ebx, [ebx+ecx+1]        ; add bad char shift
          ! mov ecx, cval               ; reset counter in compare loop
          ! mov esi, ebx
          ! cmp esi, ExitLen
          ! jle Cmp_Loop
      
          ! mov eax, -1
          ! jmp Cleanup
      
        Suffix_Shift:
          ! add ebx, edx                ; add suffix shift
          ! mov ecx, cval               ; reset counter in compare loop
          ! mov esi, ebx
          ! cmp esi, ExitLen
          ! jle Cmp_Loop
      
          ! mov eax, -1                 ; set value for no match
      
        Cleanup:
      
          ! mov FUNCTION, eax
      
      END FUNCTION
      
      ' ########################################################################
    ------------------
Working...
X