Get startedGet started for free

Comparing strings

1. Comparing strings

Welcome to the final chapter of the course! In this chapter, we'll be focusing on string matching and record linkage.

2. Measuring distance between values

If we have two values, a 3 and a 10,

3. Measuring distance between values

we can measure the distance between them

4. Measuring distance between values

using subtraction. The distance between them is 10 minus 3, which is 7. Comparing numbers is easy, but how do we compare strings? Before we dive into record linkage, we need to learn about how to measure distance between strings.

5. Minimum edit distance

Edit distance is a way of measuring how different two strings are from each other, based on the four basic kinds of typos, which are inserting a character,

6. Minimum edit distance

deleting a character,

7. Minimum edit distance

substituting one character for another,

8. Minimum edit distance

and transposing, or swapping the positions of, two characters. Minimum edit distance is the fewest number of typos you'd need to convert one string to another.

9. Edit distance = 1

Here are some examples. To turn "dog" into "dogs", we insert an "s".

10. Edit distance = 1

To turn "bath" into "bat", we remove the "h".

11. Edit distance = 1

To turn "cats" into "rats", we substitute an "r" for the "c".

12. Edit distance = 1

To turn "sing" into "sign", we swap the positions of the "n" and the "g". Since all of these pairs require only one edit, they all have an edit distance of 1.

13. A more complex example

Let's look at a more complex example and calculate the edit distance between "baboon" and "typhoon".

14. A more complex example

First, we'll need to insert an "h".

15. A more complex example

Then, we'll need to substitute "t" for "b",

16. A more complex example

substitute "y" for "a",

17. A more complex example

and substitute "p" for "b". This gives us a total of 4 actions, or an edit distance of 4.

18. Types of edit distance

There are multiple types of edit distance that each calculate distance between strings a little differently. The Damerau-Levenshtein distance is what you just learned. The Levenshtein distance does not count transposition as a single action - instead, it counts as 2: a deletion and an insertion. The Longest Common Subsequence, or LCS, distance considers only insertion and deletion as actions. There are also other more complex ways of calculating string distance, such as Jaro-Winkler or Jaccard distance. Each method has a relative advantage in different circumstances, but going into the details of when to use each one is out of the scope of the course. However, feel free to experiment with the different methods and use the one that gives you the best results.

19. String distance in R

Let's go back to our baboon-typhoon example. To calculate edit distances in R, you can use stringdist from the stringdist package, passing it the two strings to compare and the method you want to use. In this case, we're using "dl", which stands for Damerau-Levenshtein.

20. Other methods

Using the method argument, we can also calculate different types of distances that we've discussed. The output of Jaccard is on a scale of 0 to 1, where numbers closer to 0 indicate that the strings are more similar.

21. Comparing strings to clean data

In chapter 2, you collapsed multiple categories into one using manually defined categories. But if there are too many variations to type out, we can use string distance to map them to the correct category.

22. Comparing strings to clean data

Here's a survey where participants in New York, Chicago, Los Angeles, and Seattle were asked where they currently live, and how likely they are to consider moving away on a scale of 1 to 5. The survey had free text entry, so the city column is riddled with typos. To map them to the correct spelling, we can compare the distance between each survey response and the set of possible answers, and choose the one that's closest.

23. Remapping using string distance

The fuzzyjoin package allows us to do joins based on string distance. We can use stringdist_left_join to join the survey data to cities. Just like the stringdist function, we can pass in the method we want to use for string distance.

24. Remapping using string distance

We can also use the max_dist argument to adjust how close we want the strings to be in order to consider them a match. Notice how we get an NA, since the typo in row 8 wasn't close enough to any of the cities to be assigned to one.

25. Let's practice!

It's time to take a swing at comparing strings!