Announcement

Collapse

Forum Guidelines

This forum is for finished source code that is working properly. If you have questions about this or any other source code, please post it in one of the Discussion Forums, not here.
See more
See less

Levenshtein string distance algorithm (string similarity test)

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Levenshtein string distance algorithm (string similarity test)

    'Updated6th June 2001, cheers Lance for the optimisations

    'LEVENSHTEIN.BAS - Ported by Wayne Diamond to PB from the VB sample at http://www.merriampark.com/ld.htm
    'Quote from ld.htm:
    ' Levenshtein distance (LD) is a measure of the similarity between two strings,
    ' which we will refer to as the source string (s) and the target string (t). The
    ' distance is the number of deletions, insertions, or substitutions required to
    ' transform s into t. For example, If s is "test" and t is "test", then LD(s,t) = 0,
    ' because no transformations are needed. The strings are already identical. If s is
    ' "test" and t is "tent", then LD(s,t) = 1, because one substitution
    ' (change "s" to "n") is sufficient to transform s into t. The greater
    ' the Levenshtein distance, the more different the strings are.
    ' Levenshtein distance is named after the Russian scientist Vladimir
    ' Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce
    ' Levenshtein, the metric is also sometimes called edit distance.

    Code:
    #COMPILE EXE "lsdist.exe"
      
    '********************************
    '*** Compute Levenshtein Distance
    '********************************
    FUNCTION LD(BYVAL s AS STRING, BYVAL t AS STRING) AS LONG
    DIM d() AS LONG  ' matrix
    DIM m AS LONG    ' length of t
    DIM n AS LONG    ' length of s
    DIM i AS LONG    ' iterates through s
    DIM j AS LONG    ' iterates through t
    DIM s_i AS STRING   ' ith character of s
    DIM t_j AS STRING   ' jth character of t
    DIM cost AS LONG ' cost
      ' Step 1
      n = LEN(s)
      m = LEN(t)
      IF n = 0 THEN
        LD = m
        EXIT FUNCTION
      END IF
      IF m = 0 THEN
        LD = n
        EXIT FUNCTION
      END IF
      REDIM d(0 TO n, 0 TO m) AS LONG
      ' Step 2
      FOR i = 0 TO n
        d(i, 0) = i
      NEXT i
      FOR j = 0 TO m
        d(0, j) = j
      NEXT j
      ' Step 3
      FOR i = 1 TO n
        s_i = MID$(s, i, 1)
        ' Step 4
        FOR j = 1 TO m
          t_j = MID$(t, j, 1)
          ' Step 5
          IF s_i = t_j THEN
            cost = 0
          ELSE
            cost = 1
          END IF
          ' Step 6
          d(i, j) = MIN%(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
        NEXT j
      NEXT i
      ' Step 7
      LD = d(n, m)
      ERASE d
    END FUNCTION
     
    FUNCTION PBMAIN() AS LONG
    DIM LevenshteinDistance AS LONG
    String1$ = "test"
    String2$ = "test"
    LevenshteinDistance = LD(String1$, String2$)
    STDOUT "The Levenshtein Distance between " & String1$ & " and " & String2$ & " is: " & STR$(LevenshteinDistance)
     
    String1$ = "pie"
    String2$ = "pies"
    LevenshteinDistance = LD(String1$, String2$)
    STDOUT "The Levenshtein Distance between " & String1$ & " and " & String2$ & " is: " & STR$(LevenshteinDistance)
     
    String1$ = "doom"
    String2$ = "wolfenstein"
    LevenshteinDistance = LD(String1$, String2$)
    STDOUT "The Levenshtein Distance between " & String1$ & " and " & String2$ & " is: " & STR$(LevenshteinDistance)
    END FUNCTION

    [This message has been edited by Wayne Diamond (edited June 05, 2001).]

  • George Bleck
    replied
    Wayne, thanks for your code! I am sure this can be sped up even more but I was able to get almost an order of magnitude (~8.5x) increase in speed from your original code (circa 2001)..

    I also added the capability to disable case sensitivity.

    Code:
    FUNCTION fn_LevenshteinDistance( BYVAL v_strText1 AS STRING, BYVAL v_strText2 AS STRING, OPT v_lngCaseMatters AS LONG ) AS LONG
      LOCAL v_lngText1Idx   AS LONG
      LOCAL v_lngText1Len   AS LONG
      LOCAL v_lngText2Idx   AS LONG
      LOCAL v_lngText2Len   AS LONG
      LOCAL v_pbytText1Char AS BYTE PTR
      LOCAL v_pbytText2Char AS BYTE PTR
      v_lngText1Len = LEN( v_strText1 )
      v_lngText2Len = LEN( v_strText2 )
      IF ( v_lngText1Len = 0 ) OR ( v_lngText2Len = 0 ) THEN FUNCTION = MAX%( v_lngText1Len, v_lngText2Len ) : EXIT FUNCTION
      IF ISFALSE( ISMISSING( v_lngCaseMatters )) AND ISFALSE( v_lngCaseMatters ) THEN
        v_strText1 = UCASE$( v_strText1 )
        v_strText2 = UCASE$( v_strText2 )
      END IF
      DIM v_lngMatrix( v_lngText1Len, v_lngText2Len ) AS LONG
      FOR v_lngText1Idx = 0 TO v_lngText1Len
        v_lngMatrix( v_lngText1Idx, 0 ) = v_lngText1Idx
      NEXT v_lngText1Idx
      FOR v_lngText2Idx = 0 TO v_lngText2Len
        v_lngMatrix( 0, v_lngText2Idx ) = v_lngText2Idx
      NEXT v_lngText2Idx
      v_pbytText1Char = STRPTR( v_strText1 )
      v_pbytText2Char = STRPTR( v_strText2 )
      FOR v_lngText1Idx = 1 TO v_lngText1Len
        FOR v_lngText2Idx = 1 TO v_lngText2Len
          v_lngMatrix( v_lngText1Idx, v_lngText2Idx ) = _
                MIN%( _
                      v_lngMatrix( v_lngText1Idx - 1, v_lngText2Idx     ) + 1, _
                      v_lngMatrix( v_lngText1Idx,     v_lngText2Idx - 1 ) + 1, _
                      v_lngMatrix( v_lngText1Idx - 1, v_lngText2Idx - 1 ) + ABS( @v_pbytText1Char[ v_lngText1Idx - 1 ] <> @v_pbytText2Char[ v_lngText2Idx - 1 ] ) _
                    )
        NEXT v_lngText2Idx
      NEXT v_lngText1Idx
      FUNCTION = v_lngMatrix( v_lngText1Len, v_lngText2Len )
    END FUNCTION 
    Last edited by George Bleck; 11 Dec 2015, 02:36 PM.

    Leave a comment:


  • pedro c camargo
    replied
    Is it possible to get the actual changes (insertions, deleteions needed to make string A equal to string B?

    Thank you

    ------------------
    Pedro Camargo

    Leave a comment:


  • Wayne Diamond
    replied
    Paul, yes
    Heres more sample output, enclosed with "" to show space chars:
    Code:
    The Levenshtein Distance between "Test string" and " Test string" is:  1
    The Levenshtein Distance between " Test string" and "Test string" is:  1
    The Levenshtein Distance between "Test string" and "est string" is:  1
    The Levenshtein Distance between "Test string" and "Test strin" is:  1
    The Levenshtein Distance between "Test strin" and " Test string" is:  2

    ------------------

    Leave a comment:


  • Paul Dwyer
    replied
    What if you wanted to compare

    "Test String" &
    "est String"

    Is the difference only 1 ??



    ------------------
    Paul Dwyer
    Network Engineer
    Aussie in Tokyo

    Leave a comment:


  • Lance Edmonds
    replied
    You can replace:
    Code:
    d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
    with:
    Code:
    d(i, j) = MIN%(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
    and remove the redundant function. Also, using LONG integers in PB/CC and PB/DLL is much more efficient than integers - this also ensures that very long strings can be analysed.

    ------------------
    Lance
    PowerBASIC Support
    mailto:support@powerbasic.comsupport@powerbasic.com</A>

    Leave a comment:

Working...
X