Linking them together!
In the last lesson, you've finished the bulk of the work on your effort to link restaurants and restaurants_new. You've generated the different pairs of potentially matching rows, searched for exact matches between the cuisine_type and city columns, but compared for similar strings in the rest_name column. You stored the DataFrame containing the scores in potential_matches.
Now it's finally time to link both DataFrames. You will do so by first extracting all row indices of restaurants_new that are matching across the columns mentioned above from potential_matches. Then you will subset restaurants_new on these indices, and finally concatenate the non-duplicate values with restaurants. All DataFrames are in your environment, alongside pandas imported as pd.
This exercise is part of the course
Cleaning Data in Python
Exercise instructions
- Isolate instances of
potential_matcheswhere the row sum is above or equal to 3 by using the.sum()method. - Extract the second column index from
matches, which represents row indices of matching record fromrestaurants_newby using the.get_level_values()method. - Subset
restaurants_newfor rows that are not inmatching_indices. - Concatenate
restaurantsandnon_dup.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Isolate potential matches with row sum >=3
matches = ____[____.___(____) >= ____]
# Get values of second column index of matches
matching_indices = matches.____.____(____)
# Subset restaurants_new based on non-duplicate values
non_dup = ____[~restaurants_new.index.____(____)]
# Concatenate restaurants and non_dup
full_restaurants = pd.____([____, ____])
print(full_restaurants)