The last Boyer Moore algo I had a go at failed on repeat sequences of
characters and it stemmed from trying to write the algorithm with only
the "good suffix" shift table. I hunted up enough information to write
a dual heuristic version that does both the "bad character" shift and
the "good suffix" shift and this one has been testing for weeks with no
problems and it is faster as well.
It is an algorithm with a reasonably narrow usage, it was originally
designed back in 1977 for very high speed pattern matching where the
overhead of constructing the shift table did not matter. With normal
string search requirements, INSTR is easier to use and more flexible.
If you have the need to search very large blocks of information, this
algorithm is well suited to the task. It has the characteristic of
getting faster as the search pattern gets longer. It starts to get
faster than a normal search algorithm once the pattern to be searched
for is over 6 characters long. With long search patterns, (70 - 80 characters)
I am getting over 300 meg/sec on my PIII 600 so it is reasonably fast.
A little piece of pure PowerBASIC for your pleasure.
[email protected]
------------------
characters and it stemmed from trying to write the algorithm with only
the "good suffix" shift table. I hunted up enough information to write
a dual heuristic version that does both the "bad character" shift and
the "good suffix" shift and this one has been testing for weeks with no
problems and it is faster as well.
It is an algorithm with a reasonably narrow usage, it was originally
designed back in 1977 for very high speed pattern matching where the
overhead of constructing the shift table did not matter. With normal
string search requirements, INSTR is easier to use and more flexible.
If you have the need to search very large blocks of information, this
algorithm is well suited to the task. It has the characteristic of
getting faster as the search pattern gets longer. It starts to get
faster than a normal search algorithm once the pattern to be searched
for is over 6 characters long. With long search patterns, (70 - 80 characters)
I am getting over 300 meg/sec on my PIII 600 so it is reasonably fast.
A little piece of pure PowerBASIC for your pleasure.
[email protected]
Code:
' ######################################################################### FUNCTION BMBinSearch(ByVal startpos as LONG, _ ByVal lpSource as LONG,ByVal srcLngth as LONG, _ ByVal lpSubStr as LONG,ByVal subLngth as LONG) as LONG #REGISTER NONE LOCAL cval as DWORD LOCAL ExitLen as DWORD LOCAL shift_table as STRING * 1024 ' 256 DWORDs ! cmp subLngth, 1 ! jg tooShort ! mov eax, -2 ; string too short, must be > 1 ! jmp Cleanup tooShort: ! mov esi, lpSource ! add esi, srcLngth ! sub esi, subLngth ! mov ExitLen, esi ; set Exit Length ' ---------------------------------------- ' load shift table with value in subLngth ' ---------------------------------------- ! mov ecx, 256 ! mov eax, subLngth ! lea edi, shift_table ! rep stosd ' ---------------------------------------------- ' load decending count values into shift table ' ---------------------------------------------- ! mov ecx, subLngth ; SubString length in ECX ! dec ecx ; correct for zero based index ! mov esi, lpSubStr ; address of SubString in ESI ! lea edi, shift_table ! xor eax, eax Write_Shift_Chars: ! mov al, [esi] ; get the character ! inc esi ! mov [edi+eax*4], ecx ; write shift for each character ! dec ecx ; to ascii location in table ! jnz Write_Shift_Chars ' ----------------------------- ' set up for main compare loop ' ----------------------------- ! mov ecx, subLngth ! dec ecx ! mov cval, ecx ! mov esi, lpSource ! mov edi, lpSubStr ! mov ebx, esi ; EBX as location pointer ! add esi, startpos ; add starting position Cmp_Loop: ! xor eax, eax ; zero EAX ! mov al, [esi+ecx] ! cmp al, [edi+ecx] ; cmp characters in ESI / EDI ! jne Set_Shift ; if not equal, get next shift ! dec ecx ! jns Cmp_Loop Match: ! sub esi, lpSource ; sub source from ESI ! mov eax, esi ; put length in eax ! jmp Cleanup Set_Shift: ! mov edi, lpSubStr ; restore sub string address ! mov edx, shift_table[eax*4] ; get char shift value ! cmp edx, subLngth ; is it pattern length ? ! jne Suffix_Shift ; if not, jump to Suffix_Shift Bad_Char_Shift: ! lea ebx, [ebx+ecx+1] ; add bad char shift ! mov ecx, cval ; reset counter in compare loop ! mov esi, ebx ! cmp esi, ExitLen ! jle Cmp_Loop ! mov eax, -1 ! jmp Cleanup Suffix_Shift: ! add ebx, edx ; add suffix shift ! mov ecx, cval ; reset counter in compare loop ! mov esi, ebx ! cmp esi, ExitLen ! jle Cmp_Loop ! mov eax, -1 ; set value for no match Cleanup: ! mov FUNCTION, eax END FUNCTION ' ########################################################################
Comment