Performing a string distance join
Bringing together two different data sources is a very common task in data analysis. Whenever possible, you should use clearly identifiable values like an email address to join two tables by. But what if a user only inputted their name and you have to look it up in a user database? The difficulty: People might abbreviate their first or last name, mistype something, or leave out parts entirely.
In the scope there are two data frames: user_input and database. The first contains the flawed user input and the second the correct names, but both data sources contain the same 100 names. How many of them can you match with a string distance join? By the way: There is no distance method defined, so the default, Optimal String Alignment distance "osa" will be used.
Este exercício faz parte do curso
Intermediate Regular Expressions in R
Instruções do exercício
- Join
user_inputanddatabasewith a maximum string distancemax_distso exactly eighty names are matched successfully. Experiment until you find the right maximum distance. - Use the newly created table
joinedto print a human friendly report sentence.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
# Join the data frames on a maximum string distance of 2
joined <- stringdist_join(
user_input,
database,
by = c("user_input" = "name"),
___ = ___,
distance_col = "distance",
ignore_case = TRUE
)
# Print the number of rows of the newly created data frame
print(glue(
"{n} out of 100 names were matched successfully",
n = nrow(___)
))