1. Comparing strings
Awesome work on chapter 3! Welcome to the final chapter of this course,
2. In this chapter
where we'll discover the world of record linkage. But before we get deep dive into record linkage, let's sharpen our understanding of string similarity and minimum edit distance.
3. Minimum edit distance
Minimum edit distance is a systematic way to identify how close 2 strings are. For example, let's take a look at the following two words: intention, and execution.
The minimum edit distance between them is the least possible amount of steps, that could get us from the word intention to execution, with the available operations being
4. Minimum edit distance
inserting new characters, deleting them, substituting them, and transposing consecutive characters.
5. Minimum edit distance
To get from intention to execution,
6. Minimum edit distance
We first start off by deleting I from intention, and adding C between E and N. Our minimum edit distance so far is 2, since these are two operations.
7. Minimum edit distance
Then we substitute the first N with E, T with X, and N with U, leading us to execution! With the minimum edit distance being 5.
8. Minimum edit distance
The lower the edit distance, the closer two words are. For example, the two different typos of reading have a minimum edit distance of 1 between them and reading.
9. Minimum edit distance algorithms
There's a variety of algorithms based on edit distance that differ on which operations they use, how much weight attributed to each operation, which type of strings they're suited for and more, with a variety of packages to get each similarity.
10. Minimum edit distance algorithms
For this lesson, we'll be comparing strings using Levenshtein distance since it's the most general form of string matching by using the thefuzz package.
11. Simple string comparison
thefuzz is a package to perform string comparison.
We first import fuzz from thefuzz, which allow us to compare between single strings.
Here we use fuzz's WRatio function to compute the similarity between reading and its typo, inputting each string as an argument.
For any comparison function using thefuzz, our output is a score from 0 to 100 with 0 being not similar at all, 100 being an exact match.
Do not confuse this with the minimum edit distance score from earlier, where a lower minimum edit distance means a closer match.
12. Partial strings and different orderings
The WRatio function is highly robust against partial string comparison with different orderings. For example here we compare the strings Houston Rockets and Rockets, and still receive a high similarity score.
The same can be said for the strings Houston Rockets vs Los Angeles Lakers and Lakers vs Rockets, where the team names are only partial and they are differently ordered.
13. Comparison with arrays
We can also compare a string with an array of strings by using the extract function from the process module from fuzzy wuzzy. Extract takes in a string, an array of strings, and the number of possible matches to return ranked from highest to lowest.
It returns a list of tuples with 3 elements, the first one being the matching string being returned, the second one being its similarity score, and the third one being its index in the array.
14. Collapsing categories with string similarity
In chapter 2, we learned that collapsing data into categories is an essential aspect of working with categorical and text data, and we saw how to manually replace categories in a column of a DataFrame.
But what if we had so many inconsistent categories that a manual replacement is simply not feasible? We can easily do that with string similarity!
15. Collapsing categories with string matching
Say we have DataFrame named survey containing answers from respondents from the state of New York and California asking them how likely are you to move on a scale of 0 to 5.
The state field was free text and contains hundreds of typos. Remapping them manually would take a huge amount of time. Instead, we'll use string similarity.
We also have a category DataFrame containing the correct categories for each state. Let's collapse the incorrect categories with string matching!
16. Collapsing all of the state
We first create a for loop iterating over each correctly typed state in the categories DataFrame.
For each state, we find its matches in the state column of the survey DataFrame, returning all possible matches by setting the limit argument of extract to the length of the survey DataFrame.
Then we iterate over each potential match, isolating the ones only with a similarity score higher or equal than 80 with an if statement.
Then for each of those returned strings, we replace it with the correct state using the loc method.
17. Record linkage
Record linkage attempts to join data sources that have similarly fuzzy duplicate values, so that we end up with a final DataFrame with no duplicates by using string similarity. We'll cover record linkage in more detail in the next couple of lessons.
18. Let's practice!
But for now, let's clean some data using string similarity!