1. Scoring and linking
Now that we've compared our pairs, it's time to score them and link the data together.
2. Last lesson
In the last lesson, we worked with these two data frames
3. Where we left off
and learned how to create and compare pairs of records. Remember that the x and y columns contain the row numbers of each pair. However, all of the comparison results are separated by column, so we'll need to combine them.
4. Scoring pairs
That's where scoring comes in.
5. Scoring with sums
One way that we could combine the separate scores is by adding them together.
6. Summing
We can do this using score_simsum, which will create a new column called simsum that holds the total score for each row.
7. Summing
We can see that the highest score is between row 2 in df_A and row 3 in df_B, which both referred to someone named Keaton Snyder.
8. Disadvantages of summing
However, summing doesn't account for the fact that having a very similar name is a stronger indicator that the records refer to the same person, while having the same sex doesn't tell us as much.
Instead of summing, we can use a probabilistic way of scoring that accounts for these differences between variables.
9. Scoring probabilistically
We can use the score_problink function, which gives us a weight for each row. The higher the weight, the more similar the pair is.
The highest weight is again between Keaton Z Snyder and Keaton Snyder.
10. Linking pairs
Now that we've scored each pair, how do we pick which ones are matches?
11. Selecting matches
We can select the pairs that we consider matches using select_n_to_m. This will select the matches with the highest scores, ensuring that any record in one data frame is only linked to one record in the other data frame at most.
Here, the only pair considered a match is row 2 of df_A and row 3 of df_B.
12. Linking the data
Now that we've selected which pairs are matches, we can finally link the two data frames together using the link function.
13. Linked data
The left side holds the data from df_A, and the right side holds the data from df_B. The first row has data for both sides, since that's the match we found, while the rest of the people were found only in one of the two data frames.
14. Let's practice!
Now that you've learned how to link data, let's practice!