Announcement

Collapse
No announcement yet.

Soundex search routine for PB/DLL

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Soundex search routine for PB/DLL

    Code:
    '¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    ' Soundex search is used for searches of similar sounding words/names.
    ' bSoundEx for PB/DLL - returns a 4-byte Soundex string for passed string.
    ' Saw some SoundEx samples here, but unfortunately, none of them fulfills
    ' all "rules" given at http://www.myatt.demon.co.uk/sxalg.htm 
    ' Wrote my own, to fulfill those rules, and to fulfill the need for high
    ' speed. Tested against all names given on above web-page, it returns correct
    ' results, so it seems to be accurate. Public Domain by Borje Hagsten, Sept 2001
    '¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    FUNCTION bSoundEx (BYVAL source AS STRING) AS STRING
      LOCAL J AS BYTE, tmp AS BYTE, _
            Letter AS BYTE PTR, Letter2 AS BYTE PTR, sResult AS STRING
     
      source = UCASE$(source)     'source string must be upper case
      sResult  = "0000"           'initiate zero-padded result string
      Letter2  = STRPTR(sResult)  'pointer to result string
      J = 1                       'position in result string
     
      FOR Letter = STRPTR(source) TO STRPTR(source) + LEN(source)
         IF @Letter > 64 AND @Letter < 91 THEN        'only A-Z are regarded (ASC 65 - 90)
            IF J = 1 THEN                             'if first character in result string
               @Letter2 = @Letter                     'just add what we got
               INCR Letter2 : INCR J                  'increase pointer and pos holder
               SELECT CASE @Letter                    'in case it is a consonant, we must store it because
                                                      'rules says next should not be from same group
                  CASE 66, 70, 80, 86                 : tmp = 49  '1 - BFPV
                  CASE 67, 71, 74, 75, 81, 83, 88, 90 : tmp = 50  '2 - CGJKQSXZ
                  CASE 68, 84                         : tmp = 51  '3 - DT
                  CASE 76                             : tmp = 52  '4 - L
                  CASE 77, 78                         : tmp = 53  '5 - MN
                  CASE 82                             : tmp = 54  '6 - R
               END SELECT
     
            ELSE
               SELECT CASE @Letter                                'else, look at the rest
                  CASE 65, 69, 73, 79, 85, 89 : tmp = 0 : ITERATE 'AEIOUY ignore these and reset tmp
                  CASE 72, 87                           : ITERATE 'HW - ignore these too, but don't reset tmp
                  CASE ELSE                                       '     (HW between consonants are ignored)
                     SELECT CASE @Letter
                        CASE 66, 70, 80, 86                 : @Letter = 49  '1 - BFPV
                        CASE 67, 71, 74, 75, 81, 83, 88, 90 : @Letter = 50  '2 - CGJKQSXZ
                        CASE 68, 84                         : @Letter = 51  '3 - DT
                        CASE 76                             : @Letter = 52  '4 - L
                        CASE 77, 78                         : @Letter = 53  '5 - MN
                        CASE 82                             : @Letter = 54  '6 - R
                        CASE ELSE                           : ITERATE
                     END SELECT
               END SELECT
     
               IF @Letter <> tmp THEN          'no doublettes, please
                  @Letter2 = @Letter : INCR J  'set character value in result string, increase pos
                  IF J > 4 THEN EXIT FOR       'result string is filled, no need to continue
                  INCR Letter2 : tmp = @Letter 'point to next result string character and store value
               END IF
            END IF
         END IF
      NEXT
     
      FUNCTION = sResult 'return the result
    END FUNCTION

    ------------------

  • #2
    Thanks Borje nice find and nice port, that's a clever little algorithm Will be especially useful for databases with usernames, to "Find Similar" rather than "Find" , very nice


    ------------------
    -

    Comment


    • #3
      Thank you, yes - just did a little test against a database with +10,000
      names here and found it to work very well, and fast. Results indicates
      it should be well suited for things like spell checkers too. Have some
      nice wordlists here, so I'll see if I can wrap something up later on,
      to show how powerful/useful this routine can be.


      ------------------

      Comment


      • #4
        Borje --
        Soundex is enough strange and unclear algorithm.
        I prefer Ratcliff/Obershelp algorithm, which was published in 1988, found and translated to PB/DLL by Mike Tora.
        I rewrote a subroutine to inline assembler and made it non-recoursive.
        Not very, but enough fast. Plus results are in 'classic form'.

        Code:
           #Compile Exe
           #Dim All
           #Register None
        
           Function Similarity(String1 As String, String2 As String) As Double
              #Register None
        
              Dim startstr1 As Long, endstr1 As Long, lenstr1 As Long, startstr2 As Long, endstr2 As Long, lenstr2 As Long
              Dim ns1 As Long, ns2 As Long, Maximum As Long, kMaximum As Long, nRef As Long
        
              lenstr1 = Len(String1) - 1
              lenstr2 = Len(String2) - 1
              startstr1 = StrPtr(String1)
              endstr1 = startstr1 + lenstr1
              startstr2 = StrPtr(String2)
              endstr2 = startstr2 + lenstr2
        
              '--- Preserve registers ---
        
              ! PUSH EBX
              ! PUSH EDI
              ! PUSH ESI
        
              '--- Put first element to query ---
              ! PUSH startstr1       ' Address (not index) of first byte in String1
              ! PUSH endstr1         ' Address of last byte in String1
              ! PUSH startstr2       ' Address of first byte in String2
              ! PUSH endstr2         ' Address of last byte in String2
        
         Label001:
              ' Query is empty ?
              ! CMP nRef, 0          ' If nRef < 0 Then _
              ! JL Label009          '    Goto Label009
        
              ' Take a combination from query
              ! POP endstr2
              ! POP startstr2
              ! POP endstr1
              ! POP startstr1
              ! DEC nRef
        
              ! MOV kMaximum, 0      ' kMaximum = 0
        
              ! MOV ESI, startstr1   ' c1 (ESI) = startstr1      ' Replacement of
         Label002:                                               '
              ! CMP ESI, endstr1     ' If c1 > endstr1 Then _    '
              ! JG Label008          '    Goto Label008          ' For c1 = startstr1 To endstr1
        
              ! MOV EDI, startstr2   ' c2 (EDI)  = startstr1     ' Replacement of
         Label003:                                               '
              ! CMP EDI, endstr2     ' If c2 > endstr2 Then _    '
              ! JG Label007          '    Goto Label007          ' For c2 = startstr2 To endstr2
        
              ! MOV EAX, ESI         ' Temporary variables k1 (EAX), k2 (EBX) (are used instead of i)
              ! MOV EBX, EDI         ' k1 = c1: k2 = c2
        
         Label004:
              ! MOV CL, [EAX]        ' Eq. arr1(c1 + i)
              ! MOV CH, [EBX]        ' Eq. arr2(c2 + i)
              ! CMP CL, CH           ' If arr1(c1 + i) <> arr2(c2 + i) Then _
              ! JNE Label006         '    Goto Label006
        
              ! INC EAX              ' Incr k1  (Instead of Incr i)
              ! INC EBX              ' Incr k2
        
              ! MOV ECX, EAX         ' ECX (eq. i) = k1 - c1
              ! SUB ECX, ESI         '
        
              ! CMP ECX, kMaximum    ' If i <= kMaximum Then _
              ! JLE Label005         '    Goto Label005
        
              ! MOV ns1, ESI         ' ns1 = c1
              ! MOV ns2, EDI         ' ns2 = c2
              ! MOV kMaximum, ECX    ' kMaximum = i
        
         Label005:
              ! CMP EAX, endstr1     ' If c1 > endstr1 Then _
              ! JG Label006          '    Goto Label006 (Exit Do)
        
              ! CMP EBX, endstr2     ' If c2 > endstr2 Then _
              ! JLE Label004         '    Goto Label004 (Exit Do)
        
         Label006:
              ! INC EDI              ' Incr c2
              ! JMP Label003         ' Goto Label003 (eq. Next c2)
        
         Label007:
              ! INC ESI              ' Incr c1
              ! JMP Label002         ' Goto Label003 (eq. Next c1)
        
         Label008:
              ! CMP kMaximum, 0      ' If kMaximum = 0 Then _
              ! JE Label001          '    Goto Label001
        
              ' Instead of SubStrSim(ns1 + Maximum, endstr1, ns2 + Maximum, endstr2)
              ! INC nRef             ' Next combination in query
        
              ! MOV ESI, ns1
              ! ADD ESI, kMaximum    ' ns1 + Maximum
              ! MOV EDI, ns2
              ! ADD EDI, kMaximum    ' ns2 + Maximum
        
              ! PUSH ESI
              ! PUSH endstr1
              ! PUSH EDI
              ! PUSH endstr2
        
              ' Instead of SubStrSim(startstr1, ns1 - 1, startstr2, ns2 - 1)
              ! DEC ns1              ' ns1 - 1
              ! DEC ns2              ' ns2 - 1
        
              ! INC nRef             ' Next combination in query
              ! PUSH startstr1
              ! PUSH ns1
              ! PUSH startstr2
              ! PUSH ns2
        
              ! MOV ECX, kMaximum    ' Maximum =
              ! ADD Maximum, ECX     '    Maximum + kMaximum
              ! JMP Label001         ' Continue
        
              ' Restore registers
         Label009:
              ! POP ESI
              ! POP EDI
              ! POP EBX
        
              Similarity = Maximum / (lenstr1 + lenstr2 + 2) * 2
        
           End Function
           
           Function PbMain
              Local simindex As Double, str1 As String, str2 As String, result As String, _
                 i As Long, start As Single, finish As Single
        
              str1 = UCase$("Borje Hagsten")
              str2 = UCase$("Boje Hasgten") ' With errors
        
              start = Timer
              For i = 1 To 100000
                 simindex = Similarity(str1, str2)
              Next
              finish = Timer
        
              result = Format$(simindex * 100, "##.##")
              MsgBox "String Similarity Index = " & result & "%. Time =" & Format$(finish - start, "#,.#####") & " Secs / 100000 calls"
           End Function
        ------------------
        E-MAIL: matus@perevozki.ru

        Comment


        • #5
          Thanks Semen, highly interesting piece of code. For my own word processor,
          Tolken99, I wrote something like it for the spell checker, calculating a
          similarity percentage instead of a weighted value, since that is far more
          reliable when working with other languages than English.

          That is backside of the soundex algo - it almost only works with English
          names/words and yes, it is a bit strange algo..


          ------------------

          Comment


          • #6
            Many thanks for the mention Semen, but you did most of the work!

            I have used the Soundex algorithm and some variants I have developed for matching words and concatenated and sorted phrases on many projects.
            They have all worked extremely well. Semen mentions the Ratcliff Obershelp algorithm which has layed unused in the research material on the web since the early eighties.
            This algorithm is one of a group of approaches used for approximate string matching. It is very very useful in the right circumstances for calculating the 'quality' of match between two strings.
            As a measure of 'edit distance' it is one of a number of approaches which include the calculation of Hamming Distance and Levenshtein Distance. The OR algorithm is the least expensive to calculate.
            Its one weakness is that, because it in effect calculates the longest common substring(s) to the two strings to be compared, the measure does not implicitly include any information about the 'order' of the common characters.
            I have addressed this in a variant of the algorithm and it makes the approach even more accurate in approximate string matching tasks.

            Hutch has done brilliant work in providing a fast implementation of the Boyer Moore exact string matching algorithm. Search the forums for this excellent work.

            Whilst trying to develop these approaches I found it difficult to separate the wood from the trees when reviewing the research material on the net. So if anyone wants any pointers on this stuff, I would be happy to steer them in the right direction.
            These forums have been a godsend to me on many a dark night. IMHO the best resource on the net!:-)

            Regards

            Mike Letchford - (tora)



            ------------------
            Mike Letchford

            Comment

            Working...
            X