Performing a string distance join
Bringing together two different data sources is a very common task in data analysis. Whenever possible, you should use clearly identifiable values like an email address to join two tables by. But what if a user only inputted their name and you have to look it up in a user database? The difficulty: People might abbreviate their first or last name, mistype something, or leave out parts entirely.
In the scope there are two data frames: user_input
and database
. The first contains the flawed user input and the second the correct names, but both data sources contain the same 100 names. How many of them can you match with a string distance join? By the way: There is no distance method
defined, so the default, Optimal String Alignment distance "osa"
will be used.
This exercise is part of the course
Intermediate Regular Expressions in R
Exercise instructions
- Join
user_input
anddatabase
with a maximum string distancemax_dist
so exactly eighty names are matched successfully. Experiment until you find the right maximum distance. - Use the newly created table
joined
to print a human friendly report sentence.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Join the data frames on a maximum string distance of 2
joined <- stringdist_join(
user_input,
database,
by = c("user_input" = "name"),
___ = ___,
distance_col = "distance",
ignore_case = TRUE
)
# Print the number of rows of the newly created data frame
print(glue(
"{n} out of 100 names were matched successfully",
n = nrow(___)
))