Methods of string distances
1. Methods of string distances
You already got to know the most famous string distance that was named after its inventor: the Levenshtein distance. The Soviet mathematician Vladimir Levenshtein first documented it in 1965. Since then, some people have come up with slight modifications to it and others have introduced other distance calculation methods. Let's compare the most important ones.2. Damerau-Levenshtein
One alternative to the classical Levenshtein distance is the so called "Damerau Levenshtein" distance. It differs from the original in one way: It allows one more edit possibility - swapping letters. Consider the following mistyped name: "Rick Caplan". As humans, we quickly spot that the user by accident hit the letter "c" first and the "i" second. With the regular edit distance, this needs 2 edits to correct - one deletion and one insertion. For the Damerau-Levenshtein distance, this is only one edit: one transposition. If the text input you work with is typed by humans, this is very helpful. If the text input is a scanned document however, transpositions are not that important.3. Method abbreviations
When using the stringdist package, we can use the following abbreviations to choose between these distance calculation methods. The regular Levenshtein distance is "lv" and the Damerau-Levenshtein distance is "dl". The default method that the stringdist package uses is the so called "optimal string alignment" distance, in short: "osa". It is equal to the Damerau-Levenshtein distance with one exception: When letters are swapped, they must not be modified again. A very slight modification that has proven to be helpful in some edge cases.4. Q-Grams (or n-grams)
A second approach that is very different to the Levenshtein distance are methods involving "qgrams". "Qgrams" are overlapping substrings of a certain length, but this sounds more complicated than it is. Let's look at an example: Let's take the name of the capital of Hawaii "Honolulu" and fragment it into its "qgrams" of length 2. The "qgrams" are all possible combinations of two letters that occur in a string and the number of its occurrences. At the end of the word, we have the letters "lu" twice. So in the "qgrams" the letters "lu" occur twice, all the other 2-letter-substrings occur once. This concept is sometimes also called "ngrams".5. Q-Grams (or n-grams)
So when we compare for example the two strings "Honolulu" and its mistyped version "Hanolulu", we have certain qgrams that are in both strings, and some that are not. In total there are 8 different qgrams. The four blue ones that we find in both of the words plus "Ho" and "on" and "Ha" and "an" that we find only in one of the two. A very simple string distance calculation is the "qgrams" method. It is just the sum of all qgrams that are not shared, so in our case: 4. These are the orange ones.6. Inspecting q-grams
If you want to have a look at these qgrams, you can run the function qgrams that is part of the stringdist package. When you pass multiple words to the function it will split both strings into substrings of the length "q" that you define. The qgrams of all the words will be used as column names in the resulting matrix. In the rows you'll find the occurrences of these substrings per word.7. Method abbreviations
You can calculate the qgram distance by passing "qgram" as the method. There are also more elaborate methods that work with these qgrams. For example the "jaccard" method. It further divides the number of qgrams that are not shared by the number of qgrams there are in total. So the jaccard distance between "Honolulu" and "Hanolulu" is 0.5 as there are four qgrams that are not shared and there are 8 unique qgrams in total. A third method is the method "cosine". It treats the qgrams of the two words as a n-dimensional space and calculates the cosine difference between the two vectors. You see, these methods yield very different values that cannot be compared to each other, nor to Levenshtein distance results. But they do all share one thing: the smaller the number, the higher the similarity between the strings.8. Let's practice!
I know, this is all still pretty abstract, but now comes the fun part: Let's practice these methods on some real world examples.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.