Trying out different methods
Perfect, you already have learned about multiple methods of calculating string distances. Which method to use depends on a lot of circumstances, so it's a good idea to play around with the different methods and their parameters a bit to get to know them better. For this exercise you'll use the search term "Marya Carey"
- a mistyped version of the name "Mariah Carey"
. How similar is the mistyped name to the real one with different methods of string distances?
The goal is to find parameters that will yield a low distance on the two names described above while maintaining a large distance to the other names in the list that are not the person one is searching for.
This exercise is part of the course
Intermediate Regular Expressions in R
Exercise instructions
- Generate the q-grams for substring length values of
1
and2
. - Calculate the string distance between
search
andnames
using the q-gram method for substring length values of1
and2
. - Calculate the string distance between
search
andnames
by using the"osa"
method.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
search <- "Mariah Carey"
names <- c("M. Carey", "Mick Jagger", "Michael Jackson")
# Pass the values 1 and 2 as "q" and inspect the qgrams
qgrams("Mariah Carey", "M. Carey", q = ___)
qgrams("Mariah Carey", "M. Carey", q = ___)
# Try the qgram method on the variables search and names
stringdist(___, ___, method = "___", q = 1)
stringdist(___, ___, method = "___", q = 2)
# Try the default method (osa) on the same input and compare
stringdist(___, ___, method = "___")