Announcement

Collapse

Forum Guidelines

This forum is for finished source code that is working properly. If you have questions about this or any other source code, please post it in one of the Discussion Forums, not here.
See more
See less

Levenshtein string distance algorithm (string similarity test)

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Levenshtein string distance algorithm (string similarity test)

    'Updated6th June 2001, cheers Lance for the optimisations

    'LEVENSHTEIN.BAS - Ported by Wayne Diamond to PB from the VB sample at http://www.merriampark.com/ld.htm
    'Quote from ld.htm:
    ' Levenshtein distance (LD) is a measure of the similarity between two strings,
    ' which we will refer to as the source string (s) and the target string (t). The
    ' distance is the number of deletions, insertions, or substitutions required to
    ' transform s into t. For example, If s is "test" and t is "test", then LD(s,t) = 0,
    ' because no transformations are needed. The strings are already identical. If s is
    ' "test" and t is "tent", then LD(s,t) = 1, because one substitution
    ' (change "s" to "n") is sufficient to transform s into t. The greater
    ' the Levenshtein distance, the more different the strings are.
    ' Levenshtein distance is named after the Russian scientist Vladimir
    ' Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce
    ' Levenshtein, the metric is also sometimes called edit distance.

    Code:
    #COMPILE EXE "lsdist.exe"
      
    '********************************
    '*** Compute Levenshtein Distance
    '********************************
    FUNCTION LD(BYVAL s AS STRING, BYVAL t AS STRING) AS LONG
    DIM d() AS LONG  ' matrix
    DIM m AS LONG    ' length of t
    DIM n AS LONG    ' length of s
    DIM i AS LONG    ' iterates through s
    DIM j AS LONG    ' iterates through t
    DIM s_i AS STRING   ' ith character of s
    DIM t_j AS STRING   ' jth character of t
    DIM cost AS LONG ' cost
      ' Step 1
      n = LEN(s)
      m = LEN(t)
      IF n = 0 THEN
        LD = m
        EXIT FUNCTION
      END IF
      IF m = 0 THEN
        LD = n
        EXIT FUNCTION
      END IF
      REDIM d(0 TO n, 0 TO m) AS LONG
      ' Step 2
      FOR i = 0 TO n
        d(i, 0) = i
      NEXT i
      FOR j = 0 TO m
        d(0, j) = j
      NEXT j
      ' Step 3
      FOR i = 1 TO n
        s_i = MID$(s, i, 1)
        ' Step 4
        FOR j = 1 TO m
          t_j = MID$(t, j, 1)
          ' Step 5
          IF s_i = t_j THEN
            cost = 0
          ELSE
            cost = 1
          END IF
          ' Step 6
          d(i, j) = MIN%(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
        NEXT j
      NEXT i
      ' Step 7
      LD = d(n, m)
      ERASE d
    END FUNCTION
     
    FUNCTION PBMAIN() AS LONG
    DIM LevenshteinDistance AS LONG
    String1$ = "test"
    String2$ = "test"
    LevenshteinDistance = LD(String1$, String2$)
    STDOUT "The Levenshtein Distance between " & String1$ & " and " & String2$ & " is: " & STR$(LevenshteinDistance)
     
    String1$ = "pie"
    String2$ = "pies"
    LevenshteinDistance = LD(String1$, String2$)
    STDOUT "The Levenshtein Distance between " & String1$ & " and " & String2$ & " is: " & STR$(LevenshteinDistance)
     
    String1$ = "doom"
    String2$ = "wolfenstein"
    LevenshteinDistance = LD(String1$, String2$)
    STDOUT "The Levenshtein Distance between " & String1$ & " and " & String2$ & " is: " & STR$(LevenshteinDistance)
    END FUNCTION

    [This message has been edited by Wayne Diamond (edited June 05, 2001).]
    -

  • #2
    You can replace:
    Code:
    d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
    with:
    Code:
    d(i, j) = MIN%(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
    and remove the redundant function. Also, using LONG integers in PB/CC and PB/DLL is much more efficient than integers - this also ensures that very long strings can be analysed.

    ------------------
    Lance
    PowerBASIC Support
    mailto:support@powerbasic.comsupport@powerbasic.com</A>
    Lance
    mailto:lanceedmonds@xtra.co.nz

    Comment


    • #3
      What if you wanted to compare

      "Test String" &
      "est String"

      Is the difference only 1 ??



      ------------------
      Paul Dwyer
      Network Engineer
      Aussie in Tokyo

      Comment


      • #4
        Paul, yes
        Heres more sample output, enclosed with "" to show space chars:
        Code:
        The Levenshtein Distance between "Test string" and " Test string" is:  1
        The Levenshtein Distance between " Test string" and "Test string" is:  1
        The Levenshtein Distance between "Test string" and "est string" is:  1
        The Levenshtein Distance between "Test string" and "Test strin" is:  1
        The Levenshtein Distance between "Test strin" and " Test string" is:  2

        ------------------
        -

        Comment


        • #5
          Is it possible to get the actual changes (insertions, deleteions needed to make string A equal to string B?

          Thank you

          ------------------
          Pedro Camargo
          Pedro Camargo

          Comment


          • #6
            Wayne, thanks for your code! I am sure this can be sped up even more but I was able to get almost an order of magnitude (~8.5x) increase in speed from your original code (circa 2001)..

            I also added the capability to disable case sensitivity.

            Code:
            FUNCTION fn_LevenshteinDistance( BYVAL v_strText1 AS STRING, BYVAL v_strText2 AS STRING, OPT v_lngCaseMatters AS LONG ) AS LONG
              LOCAL v_lngText1Idx   AS LONG
              LOCAL v_lngText1Len   AS LONG
              LOCAL v_lngText2Idx   AS LONG
              LOCAL v_lngText2Len   AS LONG
              LOCAL v_pbytText1Char AS BYTE PTR
              LOCAL v_pbytText2Char AS BYTE PTR
              v_lngText1Len = LEN( v_strText1 )
              v_lngText2Len = LEN( v_strText2 )
              IF ( v_lngText1Len = 0 ) OR ( v_lngText2Len = 0 ) THEN FUNCTION = MAX%( v_lngText1Len, v_lngText2Len ) : EXIT FUNCTION
              IF ISFALSE( ISMISSING( v_lngCaseMatters )) AND ISFALSE( v_lngCaseMatters ) THEN
                v_strText1 = UCASE$( v_strText1 )
                v_strText2 = UCASE$( v_strText2 )
              END IF
              DIM v_lngMatrix( v_lngText1Len, v_lngText2Len ) AS LONG
              FOR v_lngText1Idx = 0 TO v_lngText1Len
                v_lngMatrix( v_lngText1Idx, 0 ) = v_lngText1Idx
              NEXT v_lngText1Idx
              FOR v_lngText2Idx = 0 TO v_lngText2Len
                v_lngMatrix( 0, v_lngText2Idx ) = v_lngText2Idx
              NEXT v_lngText2Idx
              v_pbytText1Char = STRPTR( v_strText1 )
              v_pbytText2Char = STRPTR( v_strText2 )
              FOR v_lngText1Idx = 1 TO v_lngText1Len
                FOR v_lngText2Idx = 1 TO v_lngText2Len
                  v_lngMatrix( v_lngText1Idx, v_lngText2Idx ) = _
                        MIN%( _
                              v_lngMatrix( v_lngText1Idx - 1, v_lngText2Idx     ) + 1, _
                              v_lngMatrix( v_lngText1Idx,     v_lngText2Idx - 1 ) + 1, _
                              v_lngMatrix( v_lngText1Idx - 1, v_lngText2Idx - 1 ) + ABS( @v_pbytText1Char[ v_lngText1Idx - 1 ] <> @v_pbytText2Char[ v_lngText2Idx - 1 ] ) _
                            )
                NEXT v_lngText2Idx
              NEXT v_lngText1Idx
              FUNCTION = v_lngMatrix( v_lngText1Len, v_lngText2Len )
            END FUNCTION 
            Last edited by George Bleck; 11 Dec 2015, 02:36 PM.
            <b>George W. Bleck</b>
            <img src='http://www.blecktech.com/myemail.gif'>

            Comment

            Working...
            X