Custom Fuzzy Matching

1. Custom Fuzzy Matching

With the stringdist join we were able to join two tables on one text column, but what if our text column contains entries that are very similar to each other? In this lesson we are going to learn to fuzzily match two data frames based on multiple conditions, using our own custom made matching functions.

2. Combining two fuzzy matches

Often, we have more than one field in our datasets that we want to merge our data on. As strings are quite hard to match, if available, it's always a good idea to also take numbers into account. And that's exactly what we're going to do with our two lists of movie titles here. In both tables we have a title and a year. But they do not align perfectly - neither the titles nor the years.

3. Combining two fuzzy matches

Our goal is to have our list a with the external IDs from our list b. We will try to achieve this by matching the titles based on their string distances and the years by their closeness to each other. In the previous lesson, we've learned about the concept of fuzzy joins. Now we are going to apply this concept to not just one column in tables a and b, but on multiple columns.

4. Fuzzy matches: Helper functions

For the fuzzy join that we want to do, we need two helper functions. The first function will check whether the strings passed to it are alike. It does that using the stringdist function, that we've used before. It returns true if the two strings have a distance smaller or equal to five. It will get passed two columns with movie titles, we'll call them left and right. The second function will be passed the two year columns. Here we write a simple function that checks whether the numbers in the left column are equal to the ones in the right column or have a difference smaller than three years.

5. The fuzzy join

So now we are ready: we have the two dataset a and b, as well as our helper functions ready and can now do the fuzzy join. We call the fuzzy left join function from the fuzzy join package. As arguments we pass it our two datasets. As with a regular dplyr join we call it with a vector of two members to define which columns we want to match against each other. But with the fuzzy join we need to define one more vector: the matching functions, called match_fun here. Here we pass a vector of the same length as the by vector. In it we pass our two helper functions.

6. The fuzzy join: The result

When we execute the function we get a new, joined dataset: It contains all the columns from our datasets a plus the values from our dataset b for the rows that were successfully matched. We see that neither the titles nor the years are an exact match - but that's exactly the behaviour we wanted to achieve. We can now assign our original dataset a the external IDs from dataset b. We also recognize, that not all of the rows could be matched. Some of the string distances where larger than five, so the first helper function returned false in these cases. In this case, we would have to fine tune our helper function a bit more, until we manage to match all the cases. How you do this fine tuning will depend on the data that you are working with.

7. Let's practice!

Alright, you're almost done, this is the last chapter of the course. Let's put your newly gained knowledge to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.