Announcement

Collapse
No announcement yet.

Test search algorithm

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Steve Hutchesson
    replied
    Florent,

    Thanks for finding this one, failure to do last character compare
    which was easy to fix but it caused another repeat error so I have some
    more work to do in the testing and debugging.

    Bern,

    Thanks for doing the tests, I will try and get the next one a bit
    more reliable.

    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • Florent Heyworth
    replied
    Hi Steve

    the search fails on similar len searches close to one another
    as in the following test case:

    Code:
    FUNCTION PBMAIN() AS LONG
       LOCAL sSearch AS STRING
       LOCAL sPattern AS STRING
       LOCAL sTable AS STRING
       LOCAL lPos AS LONG
     
       sSearch = "hello there1223 then what's that where1223 are you"
        
       sPattern = "where1223"
        
       sTable = create_shift_table( sPattern )
       lPos = BMsearch( STRPTR( sTable), 0, STRPTR( sSearch), LEN( sSearch ), STRPTR( sPattern), LEN( sPattern ))
       PRINT "BMSearch found '" + sPattern + "' at Pos: " + FORMAT$( lPos )
        
       PRINT "INSTR found '" + sPattern + "' at Pos: " + FORMAT$( INSTR(sSearch, sPattern) )
     
       WAITKEY$
    END FUNCTION
    ------------------

    Leave a comment:


  • Bern Ertl
    replied
    OK Steve, I reached a good stopping point for the day with my code
    and I tested your code.

    I used the following test program in conjuction with the Forum4.Dat
    file which Borje made for POFFS V1. The DB is roughly 12MB. I only
    tested the cases listed in the code.

    First my code, results of tests below:

    Code:
    #COMPILE EXE
    #INCLUDE "WIN32API.INC"  'For GetTickCount
    
    
    'Steve's ASM Boyer-Moore functions saved to an .INC file
    #INCLUDE "C:\PBDLL60\DOWNLO~1\BOYERM~1\SEARCH.INC"
    
    
    FUNCTION PBMAIN() AS LONG
    
    
       'I hard coded the path to the .DAT file, you would likely need to change to use this program..
       OPEN "C:\PBDLL60\DOWNLOADS\POFFS\FORUM4.DAT" FOR BINARY AS 1
       GET$ 1, LOF( 1), A$
       CLOSE 1
    
    
       AtStartOfDB$ = "Has anyone experienced this 'bug' too?"
       InMiddleOfDB$ = "PB/DLL and all other PowerBASIC compilers are compatible with Windows 2000."
       NearEndOfBD$ = "You are in the wrong place for this part of this discussion  "
       NotInDB$ = "*&^@$*&%[email protected]!VH Y^WV$*#RT$GV FO*&H!MC()&EYM(@&TGDNI&^ASRao*)^*#$"
    
    
    '   B$ = AtStartOfDB$
    '   B$ = InMiddleOfDB$
    '   B$ = NearEndOfBD$
       B$ = NotInDB$
    
    
       Start??? = GetTickCount
       Table$ = create_shift_table( B$)
       Address& = BMsearch( STRPTR( Table$), 0, STRPTR( a$), LEN( A$), STRPTR( B$), LEN( B$))
       Fin??? = GetTickCount
    
    
       Start2??? = GetTickCount
       Address2& = INSTR( 1, A$, B$)
       Fin2??? = GetTickCount
    
    
       MSGBOX "Boyer Moore found B$ in the database at position" + STR$( Address&+1) + " in" + STR$( Fin??? - Start???) + " ticks." + $CRLF + _
              "INSTR found B$ in the database at position" + STR$( Address2&) + " in" + STR$( Fin2??? - Start2???) + " ticks."
    
    
    END FUNCTION
    In all cases, your BMSearch returned the same result as INSTR.
    It appears to be working correctly. I would need to test searches
    starting at offset values and using search strings with CHR$(0), etc.,
    before saying it works 100% though.

    On my Win98SE I got the following results:

    Code:
                      Boyer-Moore      INSTR
    AtStartOfDB$ ---> 0 ticks          0 ticks
    
    
    InMiddleOfDB$ --> 26-30 ticks      89-91 ticks
    
    
    NearEndOfBD$ ---> 64-66 ticks      189-195 ticks
    
    
    NotInDB$ -------> 45-50 ticks      190-201 ticks
    Thanks for the code Steve!




    ------------------
    Bernard Ertl

    Leave a comment:


  • Steve Hutchesson
    replied
    Semen,

    A reverse compare in many instances in the BinSearch algo would
    make the subloop exit faster as it would have less comparison to
    perform to find a mismatch and it is a variation I will try when
    I have a bit more time but it does not address the area where
    the real speed difference is occurring.

    The Boyer Moore algo is faster in its forward read speed that a
    normal byte scan and as the Boyer Moore search string gets longer
    the forward read speed gets faster as it can step forward without
    having to compare all bytes.

    Paul,

    We may just have a terminology difference here, I use binary search
    to differentiate from text search which is a different type of
    string matching algorithm in the historical data. The usage I have
    is for a search algorithm to handle the full 256 byte range.

    I could not get a copy of Bob Boyer's original paper from 1977 even
    though it is referenced in many places on the net but a google search
    on "boyer moore" will turn you up a lot of stuff on search algorithms
    including some java animations of variations and different types.

    From the info I could find, Bob Boyer's original idea had the best
    logic and it seemed that the variations (horsepool etc ...) were
    based on character count maths and the technology of ANSI C string
    handling rather than how fast it could be run through a computer.

    Bern,

    I am glad someone knew what a Boyer Moore algo was, any testing
    you have time to do will be most appreciated as I have given it
    a hammering over a wide range of things but there is always the
    risk that I have missed a circumstance where it may fail.

    Regards,

    [email protected]



    ------------------

    Leave a comment:


  • Bern Ertl
    replied
    Paul,

    The Boyer-Moore search algorithm was part of my college curriculum
    "ages" ago, you should be able to find a description of it fairly easily.

    Steve, next time I run into a stumbling block and need some time out,
    I'll give your algorithm a run on my Win98 SE...



    ------------------
    Bernard Ertl

    Leave a comment:


  • Paul Dwyer
    replied
    Binary searches don't search like that, they split in two and jump thus binary, they are not exaustive searches. you check the middle of the range and if higher then search the middle of the top half etc to you find your target

    Binary Searches only work on sorted data, as do the slightly superior Interpolation searches (which guess at the distance of the jump).

    My ASM is pretty poor but I am interested in this alogorithm and would like to hear more about how it works, do you have any links of spec you are coding from?

    Cheers


    ------------------
    Paul Dwyer
    Network Engineer
    Aussie in Tokyo

    [This message has been edited by Paul Dwyer (edited April 03, 2001).]

    Leave a comment:


  • Semen Matusovski
    replied
    Steve --
    In BinSearch you compare bytes from left to right.
    Did you try simply to change an order (from right to left) ?


    ------------------
    E-MAIL: [email protected]

    Leave a comment:


  • Steve Hutchesson
    replied
    Mel,

    I have a test setup for 3 search methods, the Boyer Moore, a
    binary search algo I posted recently and the standard INSTR.

    The test times for each on a single string of 34 meg with the
    text being searched for at the end is as follows,

    Boyer Moore = 315 milliseconds
    BinSearch = 450 milliseconds
    INSTR = 745 milliseconds

    The Boyer Moore starts with the overhead of loading the shift
    table and is still considerably faster. It is an algorithm
    originally designed for binary search so the comparison is fair
    as all 3 methods are capable of searching the full ascii character
    range.

    It is not well suited for recursive search with different strings
    of 6 characters or under because of the overhead of creating the
    table but if you have to search every file on a HDD for a virus
    pattern, perform pattern matching on very large files or perform
    text search on a very large number of files, it has generally
    outperformed anything else.

    What I need if anyone has the time and does this type of work where
    speed truly matters is some independent testing. If you don't do
    this type of patern matching, use INSTR, it works well in most
    normal instances.

    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • Mel Bishop
    replied
    Since I have no idea of what kind of data you are searching for,
    I will give you an example of my particular kind of "search
    engine" and you can take it from there if you want.

    SearchString$ = {something$}
    Open "somefile.txt" for binary as #1
    get$ #1,16000,t1$

    do
    get$ #1,16000 t2$

    test$ = t1$ + t2$
    p = instr(SearchString$,test$)
    if p > 0 then
    REM Do something with it
    end if

    if lof(1) = loc(1) then exit loop

    t1$ = t2$
    loop
    close #1

    This search technique, to me at least, is very fast. It's also
    short, sweet and to the point. This may give you some kind of
    comparison. Hope it helps.

    ------------------


    [This message has been edited by Mel Bishop (edited April 02, 2001).]

    Leave a comment:


  • Steve Hutchesson
    started a topic Test search algorithm

    Test search algorithm

    I have pasted in below a working version of a Boyer Moore string
    matching algorithm that is testing OK here. A Boyer Moore string
    matching algorithm differs from most search functions in that it
    requires a table to be loaded dynamically before the algorithm is
    run which reduces its suitability for short recursive searches.

    Its advantage is that once the table is loaded, it has a substantial
    speed advantage over more conventional search function if the
    data being searched for is over about 6 characters in length. This
    makes it suitable for searching very large bodies of information
    such as complete disk partitions, very large files etc ...

    What I need if anyone has the time is some independent testing as
    it appears to be working OK here on the range of test material I
    have run it on.

    The technique requires 2 functions to be run, pass the string being searched
    for to the "create_shift_table" function which creates the table
    in OLE string memory as its return value.

    The second function is the Boyer Moore search algorithm which takes the
    following parameters,

    Code:
    ByVal lp_shift as LONG, = StrPtr(shift_table$)
    ByVal startpos as LONG, = 0 based offset to start search from
    ByVal lpsource as LONG, = StrPtr(source$)
    ByVal lnsource as LONG, = len(source$)
    ByVal lptarget as LONG, = StrPtr(search$)
    ByVal lntarget as LONG  = len(search$)
    Apart from the address of the shift table which must be a string
    address, the function will search any address range, not just
    string data. The design of two seperate function is so that different
    types of searches can be performed, this table is set up for binary
    search so it will only do case sensitive search in string data.

    I am led to believe that J Moore coded Bob Boyer's original idea
    in PDP-10 assembler back in 1977 so coding this algo in a high
    performance language in x86 assembler is probably a reasonable
    way of doing justice to its design.

    Any feedback would be appreciated.

    Regards,

    [email protected]

    Code:
      ' #########################################################################
      
      FUNCTION create_shift_table(srch$) as STRING
      
          #REGISTER NONE
      
          LOCAL chTable as STRING     ' string to build table in
          LOCAL lpTable as LONG       ' address of table
          LOCAL lpSrch  as LONG       ' address of search string
          LOCAL ssl     as LONG       ' search string length
      
          chTable = space$(1024)      ' allocate 256 * 4 bytes
          ssl     = len(srch$)        ' get search string length
          lpSrch  = StrPtr(srch$)     ' get search string address
          lptable = StrPtr(chTable)   ' get table address
      
        ' ------------------------------------------------------
        ' fill DWORD table with the length of the search string
        ' ------------------------------------------------------
      
          ! cld                   ; clear direction flag to read forward
      
          ! mov ecx, 256
          ! mov edi, lpTable
          ! mov eax, ssl          ; write search string length to table
          ! rep stosd
      
          ! mov ecx, ssl          ; put search string length in ECX
          ! dec ecx
          ! mov esi, lpSrch       ; address search string
          ! mov edi, lpTable      ; table address
      
          ! xor eax, eax          ; zero EAX
      
        ' -----------------------------------------------
        ' write the shift vaue for each search character
        ' into the correct position in the shift table
        ' -----------------------------------------------
        Write_Shift_Characters:
          ! mov al, [esi]         ; get the character
          ! inc esi
          ! mov [edi+eax*4], ecx  ; write shift for each character
          ! dec ecx               ; to ascii location in table
          ! jnz Write_Shift_Characters
      
          FUNCTION = chTable
      
      END FUNCTION
      
      ' #########################################################################
      
      FUNCTION BMsearch(ByVal lp_shift as LONG, _
                        ByVal startpos as LONG, _
                        ByVal lpsource as LONG, _
                        ByVal lnsource as LONG, _
                        ByVal lptarget as LONG, _
                        ByVal lntarget as LONG) as LONG
      
          #REGISTER NONE
      
          LOCAL resetpos as LONG
          LOCAL buffrend as LONG
      
          ! mov esi, lpsource     ; source address
          ! mov edi, lpTarget     ; target address
      
          ! mov ecx, lntarget
          ! dec ecx               ; correct back one to read last character pair
      
          ! add esi, ecx          ; set esi to first end compare
          ! add edi, ecx          ; set edi to first end compare
      
          ! mov resetpos, edi     ; set reset position for shift loop
      
          ! mov ebx, esi          ; use EBX as shift location counter
      
        ' ------------------------------
        ' set end of source in buffrend
        ' ------------------------------
          ! mov edx, lpsource     ; source address
          ! add edx, lnsource     ; add source length
          ! inc edx               ; end of buffer + 1
          ! mov buffrend, edx
      
          ! mov edx, lp_shift     ; use EDX as address of shift table
      
          ! dec lntarget          ; dec target length for shift loop
          ! xor eax, eax          ; clear EAX
      
          ! add esi, startpos     ; move ESI forward by starting position
      
        bmMainLoop:
          ! mov al, [esi]
          ! dec esi
          ! dec edi
          ! cmp al, [edi+1]
          ! jne bmSetShift
          ! dec ecx
          ! jnz bmMainLoop        ; if ECX = 0, match has occurred
      
          ! jmp bmCalcLength      ; exit on match
      
        bmSetShift:
          ! add ebx, [edx+(eax*4)]; add shift to location counter
          ! mov esi, ebx          ; copy location counter + shift to ESI      
          ! mov ecx, lntarget     ; reset ECX to target length
          ! mov edi, resetpos     ; reset EDI to end char of target string
          ! cmp esi, buffrend     ; test end of buffer condition
          ! jbe bmMainLoop        ; if not end of buffer, do next string compare
      
          ! jmp bmNoMatch         ; else set no match and exit
      
        bmCalcLength:
          ! sub esi, lpsource
          ! mov eax, esi
          ! jmp bmexitpos
      
        bmNoMatch:
          ! mov eax, -1
      
        bmexitpos:
          ! mov FUNCTION, eax
      
      END FUNCTION
      
      ' #########################################################################
    ------------------


    [This message has been edited by Steve Hutchesson (edited April 02, 2001).]
Working...
X