Levenshtein Distance
Computing edit distance between two strings with dynamic programming — insertion, deletion, and substitution.
The levenshtein distance between 2 strings represent the minimal differences that must be applied to the first string to get it to be equivalent to the seconds string.
The 3 operations that be applied are (1) Insertion, (2) Deletion, (3) Swap. An insertion and deletion can be at any point in the string and a swap is just changing an existing letter. There is a recursive equation that is used in a dynamic programming fashion to determines the levenshtein distance of 2 strings.
Let’s breakdown this equation and how it works.
- The variables and are the inclusive terminating indices of a string. You must consider these to be 1-indexed where the first character has an index of 1. If it is not clear why we are doing this, I think it will make more sense when we do an example as the reason is highly motivated by the recursive nature of levenshtein distance.
- The can be considered the base case. The condition necessary for this is that the minimal index of either or is 0; or in otherwords, comparing 2 strings where at least 1 string is the empty string. In the case that one string is an empty string, the levenshtein distance will just be the length of the other string. This also holds true when both strings are empty string.
- The other base case is when the character is the same, . In this case, the levenshtein distance does not increase.
- If the condition in (2) was not met, we are then given the choice of 3 additional functions, where we will take the minimum of the 3.
- is insertion/deletion
- is insertion/deletion w.r.t the other string
- is a swap
Example
The classical example for levenshtein distance is finding the distance between “kitten” and “sitting”. There are several good examples laid out online. I will go over the example using “france” and “strange”.
Note that # represents an empty string
i 0 1 2 3 4 5 6 7
j # s t r a n g e
0 #
1 f
2 r
3 a
4 n
5 c
6 e
- Lets compute the base cases. The base cases include the first column and the first row, where . Consider and . The minimum length is and the max is . The intuition behind this is because in order to transform an empty string
""to"str"is 3 insertions ofs,t, andr.
i 0 1 2 3 4 5 6 7
j # s t r a n g e
0 # 0 1 2 3 4 5 6 7
1 f 1
2 r 2
3 a 3
4 n 4
5 c 5
6 e 6
- Next we must apply the recursive case to the subsequent missing spots.
Consider , . Since neither index is and the characters at those indices don’t match, we are left with the following.
- represents a table lookup at where we store the number of moves necessary to get the substring with terminating indices of and to match. We then add 1 to represent an insertion.
- represents a table lookup at where again, the necessary moves are stored. This time the insertion can be thought of with respect to the other substring.
- , in the case where the two characters are different, represents swapping out a letter to be another. This can be with respect to either word.
We will then repeat this process for all the missing entries.
i 0 1 2 3 4 5 6 7
j # s t r a n g e
0 # 0 1 2 3 4 5 6 7
1 f 1 1 2 3 4 5 6 7
2 r 2 1 2 2 3 4 5 6
3 a 3 2 2 3 2 3 4 5
4 n 4 3 3 3 3 2 3 4
5 c 5 4 4 4 4 3 3 4
6 e 6 5 5 5 5 5 5 3
The levenshtein distance will be the number in the bottom right of the table. In this case its 3. We can confirm this by inspection: (1) swap f and s, (2) insert t after the first swap then (3) swap c for g.
Implementation
def lev(s1, s2):
m = len(s1)
n = len(s2)
opt = [[-1 for _ in range(m + 1)] for _ in range(n + 1)]
#initialize base cases
for i in range(m + 1):
opt[0][i] = i
for j in range(n + 1):
opt[j][0] = j
# fill in table with recursive case
for i in range(1,n+1):
for j in range(1,m+1):
# offset the index for 1-indexness of the table iteration
if s1[j-1] == s2[i-1]:
opt[i][j] = opt[i-1][j-1]
else:
minimum = min(opt[i-1][j], opt[i][j-1], opt[i-1][j-1])
opt[i][j] = minimum + 1
return opt[n][m]
This happens to be a leetcode problem.