1. Learn
  2. /
  3. Courses
  4. /
  5. Intermediate Regular Expressions in R

Exercise

Performing a string distance join

Bringing together two different data sources is a very common task in data analysis. Whenever possible, you should use clearly identifiable values like an email address to join two tables by. But what if a user only inputted their name and you have to look it up in a user database? The difficulty: People might abbreviate their first or last name, mistype something, or leave out parts entirely.

In the scope there are two data frames: user_input and database. The first contains the flawed user input and the second the correct names, but both data sources contain the same 100 names. How many of them can you match with a string distance join? By the way: There is no distance method defined, so the default, Optimal String Alignment distance "osa" will be used.

Instructions

100 XP
  • Join user_input and database with a maximum string distance max_dist so exactly eighty names are matched successfully. Experiment until you find the right maximum distance.
  • Use the newly created table joined to print a human friendly report sentence.