1. Fuzzy joins
The concept of string distances is not an end in itself. It is only useful when applied to a real problem. Probably the most important application of string distances are so called fuzzy joins.
2. A regular join
A join operation is when we want to combine two tables and form one out of it. In this example we have Table "a" with the columns "user name" and "user id" and a table "b" with "user id" and "email". We would then join the two tables on the column "user id", as it is present in both of the tables. The result is a new table with all three columns "user name", "user id" and "email". The emails were added to that user where the "user id" from tables "a" and "b" matched up. This works well as the user id is a number that is unique and is perfectly identical in both tables.
3. A fuzzy join
But we are not always in the privileged situation where we have perfectly matching user ids. Imagine a situation where we have a list of names inputted by users. Most of them were able to type their name correctly but some users had typos or left out parts of their name so a regular join would not find the name in our database.
This is where a fuzzy join can help us. A fuzzy join is a join where the values of the first and the second table do not need to match perfectly but are allowed to have slight differences.
4. The fuzzyjoin package
To perform a fuzzy join we will use the "fuzzyjoin" package. It contains multiple functions, the first that we will use is the "stringdist join" function. As the name suggests it can join two tables based on a preset maximum string distance.
As the first two arguments we pass it the two tables that we want to join. Followed by the "by" argument that specifies the column names in both tables.
Internally the function uses the "stringdist" package that you have used in the last lesson. For this reason we can pass it the same methods as we passed to the "stringdist" function. We'll use "lv" for Levenshtein distance here. The "max dist" argument does the same thing as the "maxDist" argument in the "amatch" function, it is spelled differently though - with an underscore.
One additional argument that can come in handy later on is the "distance col". We can pass it a string that will be used as a column name. This column will then contain the string distances as numbers.
5. stringdist_join: Result
The stringdist join helps us to achieve the same result as if we had user ids: the faulty user input was correctly matched to the correct user names and email addresses. Luckily our three names were easily distinguishable and the user input and the real name had an edit distance of one. This process of combining multiple data sources into one is sometimes also called "record linkage".
6. Let's practice!
Alright, let's try this out!