1. Generating and comparing pairs
Now that we know how to compare strings to each other, let's talk about record linkage.
2. When joins won't work
Here, we have two tables showing basketball game schedules from different television networks. If we want to get a full list of the basketball games being televised, we need to combine these two tables.
3. When joins won't work
However, some of the games in the second table are duplicates of games already listed in the first table. Since there's no consistent identifier between the two tables, a regular join won't work. This is where record linkage comes in.
4. What is record linkage?
Record linkage involves linking data together that comes from multiple sources that don't share a common identifier, but contain data on the same entity.
5. What is record linkage?
Generally, we start by cleaning the different datasets.
6. What is record linkage?
Then, we find pairs of records, or rows, that we want to compare.
7. What is record linkage?
We compare those records to each other,
8. What is record linkage?
and score them based on how similar they are.
9. What is record linkage?
Finally, we link the most similar pairs together.
Our string comparison skills will come in handy for the third step, which is comparing the pairs.
10. What is record linkage?
But before we can do this, we need to generate pairs.
11. Pairs of records
Here, we have two data frames, A and B, and we want to use record linkage to identify if any of the records in Table A are referring to the same person as any records in Table B.
12. Generating pairs
In order to figure out whether any of the rows are matches, we'll need to compare every single row in table A with every single row in table B.
13. Generating pairs in R
To generate pairs of rows in R, we can use the reclin package. The pair_blocking function takes in our two data frames and returns an object that contains every possible pair of records.
14. Too many pairs
But imagine if we had a thousand rows in each table. That means we'd have a million possible pairs of records, so this approach isn't scalable.
15. Blocking
To solve this problem, what if we only choose pairs that have a matching state? This drastically reduces the number of possible pairs - now we only have 8 instead of 25.
This technique is called blocking: we only consider pairs when they agree on the blocking variable. In this case, the blocking variable is state.
16. Pair blocking in R
To use blocking, we can use the same pair_blocking function, but this time use the blocking_var argument, which takes in a column name as a string. Just as we expected, we get 8 pairs instead of 25.
17. Comparing pairs
Now that we have our pairs, we'll need to compare them to see which pairs are the most similar.
18. Comparing pairs
We can do this by piping the output of pair_blocking to the compare_pairs function. We use the by argument to indicate that the name column of each data frame should be compared, and the default_comparator argument to indicate how strings should be compared. Here, we're using the longest common subsequence method. Notice that there's a set of parentheses after lcs since we're calling a function.
The output shows a table with each pair and the corresponding score between the name of each pair.
19. Comparing multiple columns
We can also compare multiple columns of the data frames by passing a vector of columns to the by argument.
Now, we get a score for the name column and the zip column.
20. Different comparators
We can also use different methods to compare strings, such as Jaccard or Jaro-Winkler.
21. Let's practice!
It's time to compare and generate some pairs!