1. Linking DataFrames
Awesome work on the first 2 lessons! You've made it to the last lesson of this course!
2. Record linkage
At this point, you've generated your pairs, compared them, and scored them.
3. Record linkage
Now it's time to link your data!
4. Our DataFrames
Remember our census DataFrames from the video of the previous lesson?
5. What we've already done
We've already generated pairs between them, compared four of their columns, two for exact matches and two for string similarity alongside a 0.85 threshold, and found potential matches.
6. What we're doing now
Now it's time to link both census DataFrames.
7. Our potential matches
Let's look closely at our potential matches. It is a multi-index DataFrame, where we have two index columns, record id 1, and record id 2.
8. Our potential matches
The first index column, stores indices from census A.
9. Our potential matches
The second index column, stores all possible indices from census_B, for each row index of census_A.
10. Our potential matches
The columns of our potential matches are the columns we chose to link both DataFrames on, where the value is 1 for a match, and 0 otherwise.
11. Probable matches
The first step in linking DataFrames, is to isolate the potentially matching pairs to the ones we're pretty sure of.
We saw how to do this in the previous lesson, by subsetting the rows where the row sum is above a certain number of columns, in this case 3.
The output is row indices between census A and census B that are most likely duplicates.
Our next step is to extract the one of the index columns, and subsetting its associated DataFrame to filter for duplicates.
12. Probable matches
Here we choose the second index column, which represents row indices of census B. We want to extract those indices, and subset census_B on them to remove duplicates with census_A before appending them together.
13. Get the indices
We can access a DataFrame's index using the index attribute. Since this is a multi index DataFrame, it returns a multi index object containing pairs of row indices from census_A and census_B respectively.
We want to extract all census_B indices, so we chain it with the get_level_values method, which takes in which column index we want to extract its values. We can either input the index column's name, or its order, which is in this case 1.
14. Linking DataFrames
To find the duplicates in census B, we simply subset on all indices of census_B, with the ones found through record linkage.
You can choose to examine them further for similarity with their duplicates in census_A, but if you're sure of your analysis, you can go ahead and find the non duplicates by repeating the exact same line of code, except by adding a tilde at the beginning of your subset.
Now that you have your non duplicates, all you need is a simple append using the DataFrame append method of census A, and you have your linked Data!
15. Linking DataFrames
To recap, what we did was build on top of our previous work in generating pairs, comparing across columns and finding potential matches.
We then isolated all possible matches, where there are matches across 3 columns or more, ensuring we tightened our search for duplicates across both DataFrames before we link them.
Extracted the row indices of census_B where there are duplicates.
Found rows of census_B where they are not duplicated with census_A by using the tilde symbol.
And linked both DataFrames for full census results!
16. Let's practice!
Now that you know how to link DataFrames, let's put those skills to action!