Performing a string distance join
Bringing together two different data sources is a very common task in data analysis. Whenever possible, you should use clearly identifiable values like an email address to join two tables by. But what if a user only inputted their name and you have to look it up in a user database? The difficulty: People might abbreviate their first or last name, mistype something, or leave out parts entirely.
In the scope there are two data frames: user_input and database. The first contains the flawed user input and the second the correct names, but both data sources contain the same 100 names. How many of them can you match with a string distance join? By the way: There is no distance method defined, so the default, Optimal String Alignment distance "osa" will be used.
Deze oefening maakt deel uit van de cursus
Intermediate Regular Expressions in R
Oefeninstructies
- Join
user_inputanddatabasewith a maximum string distancemax_distso exactly eighty names are matched successfully. Experiment until you find the right maximum distance. - Use the newly created table
joinedto print a human friendly report sentence.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Join the data frames on a maximum string distance of 2
joined <- stringdist_join(
user_input,
database,
by = c("user_input" = "name"),
___ = ___,
distance_col = "distance",
ignore_case = TRUE
)
# Print the number of rows of the newly created data frame
print(glue(
"{n} out of 100 names were matched successfully",
n = nrow(___)
))