Get startedGet started for free

Trying out different methods

Perfect, you already have learned about multiple methods of calculating string distances. Which method to use depends on a lot of circumstances, so it's a good idea to play around with the different methods and their parameters a bit to get to know them better. For this exercise you'll use the search term "Marya Carey" - a mistyped version of the name "Mariah Carey". How similar is the mistyped name to the real one with different methods of string distances?

The goal is to find parameters that will yield a low distance on the two names described above while maintaining a large distance to the other names in the list that are not the person one is searching for.

This exercise is part of the course

Intermediate Regular Expressions in R

View Course

Exercise instructions

  • Generate the q-grams for substring length values of 1 and 2.
  • Calculate the string distance between search and names using the q-gram method for substring length values of 1 and 2.
  • Calculate the string distance between search and names by using the "osa" method.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

search <- "Mariah Carey"
names <- c("M. Carey", "Mick Jagger", "Michael Jackson")

# Pass the values 1 and 2 as "q" and inspect the qgrams
qgrams("Mariah Carey", "M. Carey", q = ___)
qgrams("Mariah Carey", "M. Carey", q = ___)

# Try the qgram method on the variables search and names
stringdist(___, ___, method = "___", q = 1)
stringdist(___, ___, method = "___", q = 2)

# Try the default method (osa) on the same input and compare
stringdist(___, ___, method = "___")
Edit and Run Code