Understanding string distances

1. Understanding string distances

In the previous chapter, you learned how to extract data from text into a tabular form. In this chapter, we'll slightly shift our focus away from regular expressions to a new topic that will also help you work with text input, but in a different way: string distances.

2. What is a string distance?

A string distance is an integer number that indicates how different two strings are from each other. Let's look a simple example: How alike are the words rain and run? One of the most widely used string distances, is the Levenshtein edit distance. The concept is: For every edit that you need to do to convert the first word into the second, the edit distance increases by one. So to convert "rain" into "run" we first remove the letter "i" and then substitute the letter "a" with "u". It took us two edits to convert "rain" into "run", so the edit distance of the two words is two.

3. What is a string distance?

Let's have a look at a second example: Let's calculate the edit distance between the words "sunday" and "saturday". In the first step, we replace the letter "n" with "r". Then we add the letter "a" and the letter "t". These were three edits, so the Levenshtein distance between "sunday" and "saturday" is three.

4. Real world applications

You might wonder: What is this good for? Well, imagine you have to work with user input. A person has to enter their name in a form field. Later, you match the name that was entered with your existing user database. So for example when the person "Emilie Brown" enters her name in the form, but misspells the first name and forgets the letter "i". The string distance to her correct name is one. If we search our user database for a perfect match, we won't be able to match the correct entry, as there probably is no "Emile Brown" in our table. But if we calculate the edit distance between the user input and all our user names and take the one with the smallest distance, we are now able to match the user input and our user database correctly.

5. Maybe the most famous example

Or, maybe the most famous application of string similarities: auto correct. When you mistype a word in a Google search, Google will ask you whether you really meant what you typed and offers an alternative version of the word. It could very well be, that the engine internally flagged the word as unknown and calculated the string distances to every entry of a dictionary to come up with this recommendation.

6. String distances in R

To calculate string distances in R, we will work with the package "stringdist". It offers multiple methods to calculate string distances - we will have a closer look at them in the next lesson. For now, we will specify the method as Levenshtein distance, in short: "lv". When we call the stringdist function with the words "saturday" and "sunday" it returns us the number 3 - as expected. This is the edit distance. Note that for the Levenshtein distance it does not matter which of the words we pass first. The number of edits will always be the same.

7. Finding a match

The stringdist package also contains the function "amatch", that tells us if a search input "x" can be found in a vector "table". This is useful when we want to search a list for a given input, that contains errors. Imagine we have the user input "Sonday" and want to find out, which day of the week the user might have meant. We specify the maximum edit distance for the search as 1. When we run this code, the return value will be 3, as the matched item is at the position 3 of the vector that we searched in. One detail: "amatch" always returns one single number. If multiple entries have the same edit distance to the search term, it will simply return the index of the first.

8. Let's practice!

Alright, let's practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.