Semen,
Interesting translation but I think the results show that the logic
of a KPM algo is not up to the task. It apparently has some advantages
on repeat patterns but generally a normal forward byte scanner
will outperform it.
The site http://www-igm.univ-mlv.fr/~lecroq/string/index.html
has some interesting stuff on it but unfortunately very little
in terms of the theory of how the different algorithms work.
I chose the Boyer Moore because it had very clean logic and could
be implemented to do the dual heuristic at the expense of an extra
jump. While there are a number of variations of it, most seem to
be based on character match count theory rather than how fast the
algorithm can be implemented.
Getting a theoretical improvement in sublinearity is of little use
if the overhead to do it makes it slower than a basic byte scanner
and it is a matter of fact that x86 processors do forward scans at
competitive rate.
I had a number of fast byte scanners to test against when I did this
version and I found that 1 register stall made the Boyer Moore slower
than a byte scanner.
The real problem with Boyer Moore type algos is that they must construct
a shift table which makes them unsuitable for short sources or patterns.
Of the variants, the Horsepool and turbo boyer moore types are apparently
reasonable performers but unless they are very well coded, a byte scanner
will be faster.
Regards,
[email protected]
------------------
Announcement
Collapse
No announcement yet.
New Boyer Moore search algorithm
Collapse
X
-
Steve --
Unf. I can give exact link to russian article only http://dox.sbnet.ru:8082/publications/findstr.ru.txt
(at the bottom you can see C code)
BTW, there is not bad English description of different algorithms http://www-igm.univ-mlv.fr/~lecroq/string/node1.html
Yesterdsy I tried Knuth-Morris-Pratt algorithm.
Code:#Compile Exe #Register None #Dim All #Include "Win32Api.Inc" Function BM (StartPos As Long, MainString As String, MatchString As String) As Long Register i As Long, j As Long Static m As Long, n As Long, x As Byte Ptr, y As Byte Ptr If StartPos < 1 Then ' Preprocessing m = Len(MatchString): x = StrPtr(MatchString) ReDim kmp_next(0 To m - 1) As Static Long i = 0: j = -1: kmp_next(0)= -1 While i < m While (j > -1) And (@x[i] <> @x[j]): j = kmp_next(j): Wend Incr i: Incr j If @x[i] = @x[j] Then kmp_next(i)= kmp_next(j) Else kmp_next(i)= j Wend n = Len(MainString): y = StrPtr(MainString) Else ' Searching i = StartPos - 1: j = 0 While i < n While (j > -1) And (@x[j] <> @y[i]): j = kmp_next(j): Wend Incr i: Incr j If (j >= m) Then Function = i-j + 1: Exit Function ' j=kmp_next(j) Wend End If End Function Function PbMain Dim i As Long, Txt As String, Pattern As String, StartPos As Long Dim t1 As Single, t2 As Single, k As Long Txt = String$(1000, " S Matus Semen Matus 12lsalkdklsdlslsllds3456") Pattern = "Matus Semen" t1 = Timer For k = 1 To 10000 BM 0, Txt, Pattern ' Initialization StartPos = 1 Do i = BM (StartPos, "", ""): If i = 0 Then Exit Do ' MsgBox Mid$(Txt, i, Len(Pattern)),, Str$(i) StartPos = i + Len(Pattern) + 1 Loop Next t2 = Timer MsgBox Format$(t2 - t1, "#.####"),, "My" t1 = Timer For k = 1 To 10000 StartPos = 1 Do i = Instr (StartPos, Txt, Pattern): If i = 0 Then Exit Do StartPos = i + Len(Pattern) + 1 Loop Next t2 = Timer MsgBox Format$(t2 - t1, "#.####"),, "Instr" End Function
------------------
E-MAIL: [email protected]
Leave a comment:
-
Semen,
I tried the URL you posted but could not find any data on the search
algo at all. I put the converted asm version into a test piece I have
here but it much slower than the ones I am working with, The Boyer
Moore I posted with a 17 character pattern on a 52 meg sample is
running at about 260 milliseconds wer this one is runing at over 2
seconds, do you have some available technical data to try and code it
from scratch ?
I have been playing with it and reduced the instruction count some
but it needs a full recode to try and get the bottlenecks out of it.
==============================================================
LATER: I have it down under 1 second but it has a long way to go to
be competitive.
==============================================================
Regards,
[email protected]
[This message has been edited by Steve Hutchesson (edited May 15, 2001).]
Leave a comment:
-
Second variant of algorithm (difference one letter maximum).
Code:#Compile Exe #Dim All #Register None #Include "Win32Api.Inc" Function ExactShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long Dim i As Dword, j As Dword, k As Dword, r As Dword, m As Dword, CVTbl(255) As Dword m = Len(Pattern) j = 1 For i = 1 To m CVTbl(Asc(Pattern, i)) = CVTbl(Asc(Pattern, i)) Or j If i <> m Then j = j + j Next For i = StartPos To Len(Txt) k = CVTbl(Asc(Txt, i)) r = (r + r + 1) And k If (r And j) Then Function = i - m + 1: Exit For Next End Function Function Fuzzy1ShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long Dim i As Dword, j As Dword, k As Dword, r1 As Dword, r2 As Dword, m As Dword, CVTbl(255) As Dword m = Len(Pattern) j = 1 For i = 1 To m CVTbl(Asc(Pattern, i)) = CVTbl(Asc(Pattern, i)) Or j If i <> m Then j = j + j Next For i = StartPos To Len(Txt) k = CVTbl(Asc(Txt, i)) r1 = r1 + r1 + 1 r2 = ((r2 + r2) And k) Or r1 r1 = r1 And k If (r2 And j) Then Function = i - m + 1: Exit For Next End Function $Text1 = "Something Semen Matusovski Something" $Text2 = "Something Semen Masusovski Something" $Pattern = "Matusov" %StartPos = 7 Function PbMain Dim i As Long If Instr(%Startpos, $Text1, $Pattern) = ExactShiftAND(%StartPos, $Text1, $Pattern) Then MsgBox "Equal" Else MsgBox "Not equal" i = Fuzzy1ShiftAND(%StartPos, $Text2, $Pattern) If i > 0 Then MsgBox $Pattern,, Mid$($Text2, i, Len($Pattern)) End Function
------------------
E-MAIL: [email protected]
Leave a comment:
-
I rewrote to inline assembler and found that Instr works very strange.
If to change String$(1000, "M 1234567890") to something else (not M as first letter), Instr will be faster in some times.
Actually, it's necessary to take real data, but it looks that in general Instr works better.
Probably, my assembler code is in-effective. But I think that main reason is algorithm as it is.
Like I wrpte, I "translated" first variant of this algorithm.
Second (main) variant is designed to find non-exact string (one, two ... letters difference).
Code (and time) is very similar to first variant.
Hope that in this case ShiftAnd will demonstrate own advantages.
Code:#Compile Exe #Dim All #Register None #Include "Win32Api.Inc" Function ExactShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long #Register None Dim sCVTbl As Static String * 1024, Tmp1 As Static Dword, Tmp2 As Static Dword ! MOV EDX, Pattern ' VarPtr(Pattern) ! MOV EDI, [EDX] ' StrPtr(Pattern) ! MOV ECX, EDI ! SUB ECX, 4 ! MOV ECX, [ECX] ' Len(Pattern) ! SUB ECX, 1 ! MOV Tmp2, ECX ! ADD ECX, EDI ! LEA EBX, sCVTbl ! MOV Tmp1, EBX ! MOV EAX, 1 ExactShiftAND_Lb1: ! MOV EDX, 0 ! MOV DL, [EDI] ! ADD EDX, EDX ! ADD EDX, EDX ! ADD EDX, Tmp1 ! MOV EBX, [EDX] ! OR EBX, EAX ! MOV [EDX], EBX ! CMP EDI, ECX ! JGE ExactShiftAND_Lb2 ! INC EDI ! ADD EAX, EAX ! JMP ExactShiftAND_Lb1 ExactShiftAND_Lb2: ! MOV EDX, Txt ' VarPtr(Txt) ! MOV EDI, [EDX] ' StrPtr(Txt) ! MOV ECX, EDI ! SUB ECX, 4 ! MOV ECX, [ECX] ' Len(Txt) ! DEC EDI ! ADD Tmp2, EDI ! ADD ECX, EDI ! MOV EBX, StartPos ! ADD EDI, [EBX] ! MOV ESI, 0 ExactShiftAND_Lb3: ! ADD ESI, ESI ! INC ESI ! MOV EDX, 0 ! MOV DL, [EDI] ! ADD EDX, EDX ! ADD EDX, EDX ! ADD EDX, Tmp1 ! AND ESI, [EDX] ! MOV EBX, ESI ! AND EBX, EAX ! JE ExactShiftAND_Lb4 ! SUB EDI, Tmp2 ! JMP ExactShiftAND_Lb6 ExactShiftAND_Lb4: ! CMP EDI, ECX ! JGE ExactShiftAND_Lb5 ! INC EDI ! JMP ExactShiftAND_Lb3 ExactShiftAND_Lb5: ! MOV EDI, 0 ExactShiftAND_Lb6: ! MOV Function, EDI End Function Function PbMain Dim Txt As String, Pattern As String, StartPos As Long Txt = String$(1000, "M 1234567890") + "Semen Matusovski" Pattern = "Matusov" StartPos = 4 If Instr(Startpos, Txt, Pattern) <> ExactShiftAND(StartPos, Txt, Pattern) Then MsgBox "Not equal": Exit Function Dim i As Long, j As Long, t1 As Single, t2 As Single t1 = Timer For i = 1 To 10000 j = Instr (StartPos, Txt, Pattern) Next t2 = Timer MsgBox Str$(t2 - t1) t1 = Timer For i = 1 To 10000 j = ExactShiftAND (StartPos, Txt, Pattern) Next t2 = Timer MsgBox Str$(t2 - t1) End Function
E-MAIL: [email protected]
Leave a comment:
-
Semen,
It is an interesting looking algorithm, I have not digested how
it works but there are many specialised applications for search
algorithms, the dictionary search appears to be one of them.
I would be interested to see what it performed like when coded
in assembler, there should be some very good speed gains if it
was written carefully.
Regards,
[email protected]
------------------
Leave a comment:
-
Wayne,
The algorithm is primarily for searching for BYTE sequences within larger bodies
of information, it can perform text search with no problems but because of its
primary binary usage, it requires the address of both the source memory and the
memory of the pattern. It also requires the length of both the source and the
pattern.
To use it you need the BYTE or STRING data in memory and have its address,
Code:LOCAL slen as LONG LOCAL src as LONG LOCAL ln as LONG LOCAL pat as LONG LOCAL pln as LONG LOCAL spos as LONG src = StrPtr(txt$) ; source BYTE or text data address ln = len(txt$) ; this would usually be supplied when you open the file pat = StrPtr(pattern$) ; pattern address pln = len(pattern$) ; pattern length spos = 0 ; where to start in the file (0 based offset) slen = BMBinSearch(spos,src,ln,pat,pln)
searching a very large data base file for text or index keys, I have heard of
some exotic uses of searching for prime numbers in big files but its strength
is in searching very large bodies of data for a specific BYTE or STRING pattern.
In performance terms, it starts to be faster than normal byte scanner type
search algorithms at about 6 characters in length and if you are searching for
long patterns, it gets much faster as it skips larger blocks of BYTE data on
a mismatch. The other factor is that it constructs a 256 entry table at startup
so it pays a penalty on short strings.
For most of my normal search requirements I use the INSTR function because it
is faster on short patterns and on shorter strings. A Boyer Moore search algo
has its advantage on long linear searches.
Regards,
[email protected]
------------------
Leave a comment:
-
There is very interesting Shift-And algorithm (Udi Manber & Sun Wu; published in 1992 - see www.arizona.edu).
I translated first part of it (exact comparasion) to PB.
Should be very fast (of course, it's necessary to convert to inline Asm).
Limitation: length of pattern is 1 .. 32 (enough for search in dictionaries).
Code:#Compile Exe #Dim All #Register None #Include "Win32Api.Inc" Function ExactShiftAND (StartPos As Long, Txt As String, Pattern As String) As Long Dim i As Dword, j As Dword, r As Dword, m As Dword, CVTbl(255) As Dword m = Len(Pattern) j = 1 For i = 1 To m CVTbl(Asc(Pattern, i)) = CVTbl(Asc(Pattern, i)) Or j If i <> m Then j = j + j Next For i = StartPos To Len(Txt) r = (r + r + 1) And CVTbl(Asc(Txt, i)) If (r And j) Then Function = i - m + 1: Exit For Next End Function $Text = "Something Semen Matusovski Something" $Pattern = "Matusov" %StartPos = 7 Function PbMain If Instr(%Startpos, $Text, $Pattern) = ExactShiftAND(%StartPos, $Text, $Pattern) Then MsgBox "Equal" Else MsgBox "Not equal" End Function
E-MAIL: [email protected]
Leave a comment:
-
Steve could you (or somebody) post demo source on how to call your BM function? And when would you use it instead of INSTR? You said it can search for patterns... what is a pattern - anything like this - "*.exe" ?
Thanks! I did a bit of Boyer-Moore stuff in my old pascal days at school but I've forgotten it all
------------------
Leave a comment:
-
New Boyer Moore search algorithm
The last Boyer Moore algo I had a go at failed on repeat sequences of
characters and it stemmed from trying to write the algorithm with only
the "good suffix" shift table. I hunted up enough information to write
a dual heuristic version that does both the "bad character" shift and
the "good suffix" shift and this one has been testing for weeks with no
problems and it is faster as well.
It is an algorithm with a reasonably narrow usage, it was originally
designed back in 1977 for very high speed pattern matching where the
overhead of constructing the shift table did not matter. With normal
string search requirements, INSTR is easier to use and more flexible.
If you have the need to search very large blocks of information, this
algorithm is well suited to the task. It has the characteristic of
getting faster as the search pattern gets longer. It starts to get
faster than a normal search algorithm once the pattern to be searched
for is over 6 characters long. With long search patterns, (70 - 80 characters)
I am getting over 300 meg/sec on my PIII 600 so it is reasonably fast.
A little piece of pure PowerBASIC for your pleasure.
[email protected]
Code:' ######################################################################### FUNCTION BMBinSearch(ByVal startpos as LONG, _ ByVal lpSource as LONG,ByVal srcLngth as LONG, _ ByVal lpSubStr as LONG,ByVal subLngth as LONG) as LONG #REGISTER NONE LOCAL cval as DWORD LOCAL ExitLen as DWORD LOCAL shift_table as STRING * 1024 ' 256 DWORDs ! cmp subLngth, 1 ! jg tooShort ! mov eax, -2 ; string too short, must be > 1 ! jmp Cleanup tooShort: ! mov esi, lpSource ! add esi, srcLngth ! sub esi, subLngth ! mov ExitLen, esi ; set Exit Length ' ---------------------------------------- ' load shift table with value in subLngth ' ---------------------------------------- ! mov ecx, 256 ! mov eax, subLngth ! lea edi, shift_table ! rep stosd ' ---------------------------------------------- ' load decending count values into shift table ' ---------------------------------------------- ! mov ecx, subLngth ; SubString length in ECX ! dec ecx ; correct for zero based index ! mov esi, lpSubStr ; address of SubString in ESI ! lea edi, shift_table ! xor eax, eax Write_Shift_Chars: ! mov al, [esi] ; get the character ! inc esi ! mov [edi+eax*4], ecx ; write shift for each character ! dec ecx ; to ascii location in table ! jnz Write_Shift_Chars ' ----------------------------- ' set up for main compare loop ' ----------------------------- ! mov ecx, subLngth ! dec ecx ! mov cval, ecx ! mov esi, lpSource ! mov edi, lpSubStr ! mov ebx, esi ; EBX as location pointer ! add esi, startpos ; add starting position Cmp_Loop: ! xor eax, eax ; zero EAX ! mov al, [esi+ecx] ! cmp al, [edi+ecx] ; cmp characters in ESI / EDI ! jne Set_Shift ; if not equal, get next shift ! dec ecx ! jns Cmp_Loop Match: ! sub esi, lpSource ; sub source from ESI ! mov eax, esi ; put length in eax ! jmp Cleanup Set_Shift: ! mov edi, lpSubStr ; restore sub string address ! mov edx, shift_table[eax*4] ; get char shift value ! cmp edx, subLngth ; is it pattern length ? ! jne Suffix_Shift ; if not, jump to Suffix_Shift Bad_Char_Shift: ! lea ebx, [ebx+ecx+1] ; add bad char shift ! mov ecx, cval ; reset counter in compare loop ! mov esi, ebx ! cmp esi, ExitLen ! jle Cmp_Loop ! mov eax, -1 ! jmp Cleanup Suffix_Shift: ! add ebx, edx ; add suffix shift ! mov ecx, cval ; reset counter in compare loop ! mov esi, ebx ! cmp esi, ExitLen ! jle Cmp_Loop ! mov eax, -1 ; set value for no match Cleanup: ! mov FUNCTION, eax END FUNCTION ' ########################################################################
Tags: None
Leave a comment: