Florent,
Thanks for finding this one, failure to do last character compare
which was easy to fix but it caused another repeat error so I have some
more work to do in the testing and debugging.
Bern,
Thanks for doing the tests, I will try and get the next one a bit
more reliable.
Regards,
[email protected]
------------------
Announcement
Collapse
No announcement yet.
Test search algorithm
Collapse
X
-
Hi Steve
the search fails on similar len searches close to one another
as in the following test case:
Code:FUNCTION PBMAIN() AS LONG LOCAL sSearch AS STRING LOCAL sPattern AS STRING LOCAL sTable AS STRING LOCAL lPos AS LONG sSearch = "hello there1223 then what's that where1223 are you" sPattern = "where1223" sTable = create_shift_table( sPattern ) lPos = BMsearch( STRPTR( sTable), 0, STRPTR( sSearch), LEN( sSearch ), STRPTR( sPattern), LEN( sPattern )) PRINT "BMSearch found '" + sPattern + "' at Pos: " + FORMAT$( lPos ) PRINT "INSTR found '" + sPattern + "' at Pos: " + FORMAT$( INSTR(sSearch, sPattern) ) WAITKEY$ END FUNCTION
Leave a comment:
-
OK Steve, I reached a good stopping point for the day with my code
and I tested your code.
I used the following test program in conjuction with the Forum4.Dat
file which Borje made for POFFS V1. The DB is roughly 12MB. I only
tested the cases listed in the code.
First my code, results of tests below:
Code:#COMPILE EXE #INCLUDE "WIN32API.INC" 'For GetTickCount 'Steve's ASM Boyer-Moore functions saved to an .INC file #INCLUDE "C:\PBDLL60\DOWNLO~1\BOYERM~1\SEARCH.INC" FUNCTION PBMAIN() AS LONG 'I hard coded the path to the .DAT file, you would likely need to change to use this program.. OPEN "C:\PBDLL60\DOWNLOADS\POFFS\FORUM4.DAT" FOR BINARY AS 1 GET$ 1, LOF( 1), A$ CLOSE 1 AtStartOfDB$ = "Has anyone experienced this 'bug' too?" InMiddleOfDB$ = "PB/DLL and all other PowerBASIC compilers are compatible with Windows 2000." NearEndOfBD$ = "You are in the wrong place for this part of this discussion " NotInDB$ = "*&^@$*&%[email protected]!VH Y^WV$*#RT$GV FO*&H!MC()&EYM(@&TGDNI&^ASRao*)^*#$" ' B$ = AtStartOfDB$ ' B$ = InMiddleOfDB$ ' B$ = NearEndOfBD$ B$ = NotInDB$ Start??? = GetTickCount Table$ = create_shift_table( B$) Address& = BMsearch( STRPTR( Table$), 0, STRPTR( a$), LEN( A$), STRPTR( B$), LEN( B$)) Fin??? = GetTickCount Start2??? = GetTickCount Address2& = INSTR( 1, A$, B$) Fin2??? = GetTickCount MSGBOX "Boyer Moore found B$ in the database at position" + STR$( Address&+1) + " in" + STR$( Fin??? - Start???) + " ticks." + $CRLF + _ "INSTR found B$ in the database at position" + STR$( Address2&) + " in" + STR$( Fin2??? - Start2???) + " ticks." END FUNCTION
It appears to be working correctly. I would need to test searches
starting at offset values and using search strings with CHR$(0), etc.,
before saying it works 100% though.
On my Win98SE I got the following results:
Code:Boyer-Moore INSTR AtStartOfDB$ ---> 0 ticks 0 ticks InMiddleOfDB$ --> 26-30 ticks 89-91 ticks NearEndOfBD$ ---> 64-66 ticks 189-195 ticks NotInDB$ -------> 45-50 ticks 190-201 ticks
------------------
Bernard Ertl
Leave a comment:
-
Semen,
A reverse compare in many instances in the BinSearch algo would
make the subloop exit faster as it would have less comparison to
perform to find a mismatch and it is a variation I will try when
I have a bit more time but it does not address the area where
the real speed difference is occurring.
The Boyer Moore algo is faster in its forward read speed that a
normal byte scan and as the Boyer Moore search string gets longer
the forward read speed gets faster as it can step forward without
having to compare all bytes.
Paul,
We may just have a terminology difference here, I use binary search
to differentiate from text search which is a different type of
string matching algorithm in the historical data. The usage I have
is for a search algorithm to handle the full 256 byte range.
I could not get a copy of Bob Boyer's original paper from 1977 even
though it is referenced in many places on the net but a google search
on "boyer moore" will turn you up a lot of stuff on search algorithms
including some java animations of variations and different types.
From the info I could find, Bob Boyer's original idea had the best
logic and it seemed that the variations (horsepool etc ...) were
based on character count maths and the technology of ANSI C string
handling rather than how fast it could be run through a computer.
Bern,
I am glad someone knew what a Boyer Moore algo was, any testing
you have time to do will be most appreciated as I have given it
a hammering over a wide range of things but there is always the
risk that I have missed a circumstance where it may fail.
Regards,
[email protected]
------------------
Leave a comment:
-
Paul,
The Boyer-Moore search algorithm was part of my college curriculum
"ages" ago, you should be able to find a description of it fairly easily.
Steve, next time I run into a stumbling block and need some time out,
I'll give your algorithm a run on my Win98 SE...
------------------
Bernard Ertl
Leave a comment:
-
Binary searches don't search like that, they split in two and jump thus binary, they are not exaustive searches. you check the middle of the range and if higher then search the middle of the top half etc to you find your target
Binary Searches only work on sorted data, as do the slightly superior Interpolation searches (which guess at the distance of the jump).
My ASM is pretty poor but I am interested in this alogorithm and would like to hear more about how it works, do you have any links of spec you are coding from?
Cheers
------------------
Paul Dwyer
Network Engineer
Aussie in Tokyo
[This message has been edited by Paul Dwyer (edited April 03, 2001).]
Leave a comment:
-
Steve --
In BinSearch you compare bytes from left to right.
Did you try simply to change an order (from right to left) ?
------------------
E-MAIL: [email protected]
Leave a comment:
-
Mel,
I have a test setup for 3 search methods, the Boyer Moore, a
binary search algo I posted recently and the standard INSTR.
The test times for each on a single string of 34 meg with the
text being searched for at the end is as follows,
Boyer Moore = 315 milliseconds
BinSearch = 450 milliseconds
INSTR = 745 milliseconds
The Boyer Moore starts with the overhead of loading the shift
table and is still considerably faster. It is an algorithm
originally designed for binary search so the comparison is fair
as all 3 methods are capable of searching the full ascii character
range.
It is not well suited for recursive search with different strings
of 6 characters or under because of the overhead of creating the
table but if you have to search every file on a HDD for a virus
pattern, perform pattern matching on very large files or perform
text search on a very large number of files, it has generally
outperformed anything else.
What I need if anyone has the time and does this type of work where
speed truly matters is some independent testing. If you don't do
this type of patern matching, use INSTR, it works well in most
normal instances.
Regards,
[email protected]
------------------
Leave a comment:
-
Since I have no idea of what kind of data you are searching for,
I will give you an example of my particular kind of "search
engine" and you can take it from there if you want.
SearchString$ = {something$}
Open "somefile.txt" for binary as #1
get$ #1,16000,t1$
do
get$ #1,16000 t2$
test$ = t1$ + t2$
p = instr(SearchString$,test$)
if p > 0 then
REM Do something with it
end if
if lof(1) = loc(1) then exit loop
t1$ = t2$
loop
close #1
This search technique, to me at least, is very fast. It's also
short, sweet and to the point. This may give you some kind of
comparison. Hope it helps.
------------------
[This message has been edited by Mel Bishop (edited April 02, 2001).]
Leave a comment:
-
Test search algorithm
I have pasted in below a working version of a Boyer Moore string
matching algorithm that is testing OK here. A Boyer Moore string
matching algorithm differs from most search functions in that it
requires a table to be loaded dynamically before the algorithm is
run which reduces its suitability for short recursive searches.
Its advantage is that once the table is loaded, it has a substantial
speed advantage over more conventional search function if the
data being searched for is over about 6 characters in length. This
makes it suitable for searching very large bodies of information
such as complete disk partitions, very large files etc ...
What I need if anyone has the time is some independent testing as
it appears to be working OK here on the range of test material I
have run it on.
The technique requires 2 functions to be run, pass the string being searched
for to the "create_shift_table" function which creates the table
in OLE string memory as its return value.
The second function is the Boyer Moore search algorithm which takes the
following parameters,
Code:ByVal lp_shift as LONG, = StrPtr(shift_table$) ByVal startpos as LONG, = 0 based offset to start search from ByVal lpsource as LONG, = StrPtr(source$) ByVal lnsource as LONG, = len(source$) ByVal lptarget as LONG, = StrPtr(search$) ByVal lntarget as LONG = len(search$)
address, the function will search any address range, not just
string data. The design of two seperate function is so that different
types of searches can be performed, this table is set up for binary
search so it will only do case sensitive search in string data.
I am led to believe that J Moore coded Bob Boyer's original idea
in PDP-10 assembler back in 1977 so coding this algo in a high
performance language in x86 assembler is probably a reasonable
way of doing justice to its design.
Any feedback would be appreciated.
Regards,
[email protected]
Code:' ######################################################################### FUNCTION create_shift_table(srch$) as STRING #REGISTER NONE LOCAL chTable as STRING ' string to build table in LOCAL lpTable as LONG ' address of table LOCAL lpSrch as LONG ' address of search string LOCAL ssl as LONG ' search string length chTable = space$(1024) ' allocate 256 * 4 bytes ssl = len(srch$) ' get search string length lpSrch = StrPtr(srch$) ' get search string address lptable = StrPtr(chTable) ' get table address ' ------------------------------------------------------ ' fill DWORD table with the length of the search string ' ------------------------------------------------------ ! cld ; clear direction flag to read forward ! mov ecx, 256 ! mov edi, lpTable ! mov eax, ssl ; write search string length to table ! rep stosd ! mov ecx, ssl ; put search string length in ECX ! dec ecx ! mov esi, lpSrch ; address search string ! mov edi, lpTable ; table address ! xor eax, eax ; zero EAX ' ----------------------------------------------- ' write the shift vaue for each search character ' into the correct position in the shift table ' ----------------------------------------------- Write_Shift_Characters: ! mov al, [esi] ; get the character ! inc esi ! mov [edi+eax*4], ecx ; write shift for each character ! dec ecx ; to ascii location in table ! jnz Write_Shift_Characters FUNCTION = chTable END FUNCTION ' ######################################################################### FUNCTION BMsearch(ByVal lp_shift as LONG, _ ByVal startpos as LONG, _ ByVal lpsource as LONG, _ ByVal lnsource as LONG, _ ByVal lptarget as LONG, _ ByVal lntarget as LONG) as LONG #REGISTER NONE LOCAL resetpos as LONG LOCAL buffrend as LONG ! mov esi, lpsource ; source address ! mov edi, lpTarget ; target address ! mov ecx, lntarget ! dec ecx ; correct back one to read last character pair ! add esi, ecx ; set esi to first end compare ! add edi, ecx ; set edi to first end compare ! mov resetpos, edi ; set reset position for shift loop ! mov ebx, esi ; use EBX as shift location counter ' ------------------------------ ' set end of source in buffrend ' ------------------------------ ! mov edx, lpsource ; source address ! add edx, lnsource ; add source length ! inc edx ; end of buffer + 1 ! mov buffrend, edx ! mov edx, lp_shift ; use EDX as address of shift table ! dec lntarget ; dec target length for shift loop ! xor eax, eax ; clear EAX ! add esi, startpos ; move ESI forward by starting position bmMainLoop: ! mov al, [esi] ! dec esi ! dec edi ! cmp al, [edi+1] ! jne bmSetShift ! dec ecx ! jnz bmMainLoop ; if ECX = 0, match has occurred ! jmp bmCalcLength ; exit on match bmSetShift: ! add ebx, [edx+(eax*4)]; add shift to location counter ! mov esi, ebx ; copy location counter + shift to ESI ! mov ecx, lntarget ; reset ECX to target length ! mov edi, resetpos ; reset EDI to end char of target string ! cmp esi, buffrend ; test end of buffer condition ! jbe bmMainLoop ; if not end of buffer, do next string compare ! jmp bmNoMatch ; else set no match and exit bmCalcLength: ! sub esi, lpsource ! mov eax, esi ! jmp bmexitpos bmNoMatch: ! mov eax, -1 bmexitpos: ! mov FUNCTION, eax END FUNCTION ' #########################################################################
[This message has been edited by Steve Hutchesson (edited April 02, 2001).]Tags: None
Leave a comment: