The cutoff point
In this exercise, and throughout this chapter, you'll be working with the restaurants
DataFrame which has data on various restaurants. Your ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.
This version of restaurants
has been collected from many sources, where the cuisine_type
column is riddled with typos, and should contain only italian
, american
and asian
cuisine types. There are so many unique categories that remapping them manually isn't scalable, and it's best to use string similarity instead.
Before doing so, you want to establish the cutoff point for the similarity score using the thefuzz
's process.extract()
function by finding the similarity score of the most distant typo of each category.
This exercise is part of the course
Cleaning Data in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import process from thefuzz
____
# Store the unique values of cuisine_type in unique_types
unique_types = ____
# Calculate similarity of 'asian' to all values of unique_types
print(process.____('____', ____, limit = len(____)))
# Calculate similarity of 'american' to all values of unique_types
print(____('____', ____, ____))
# Calculate similarity of 'italian' to all values of unique_types
print(____)