Correcting inconsistency
Now that you've identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, you'll use the new tools at your disposal to fix the inconsistent values in sfo_survey instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.
dplyr and stringr are loaded and sfo_survey is available.
This exercise is part of the course
Cleaning Data in R
Exercise instructions
- Add a column to
sfo_surveycalleddest_size_trimmedthat contains the values in thedest_sizecolumn with all leading and trailing whitespace removed. - Add another column called
cleanliness_lowerthat contains the values in thecleanlinesscolumn converted to all lowercase. - Count the number of occurrences of each category in
dest_size_trimmed. - Count the number of occurrences of each category in
cleanliness_lower.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
# dest_size_trimmed: dest_size without whitespace
mutate(dest_size_trimmed = ___,
# cleanliness_lower: cleanliness converted to lowercase
cleanliness_lower = ___)
# Count values of dest_size_trimmed
sfo_survey %>%
___
# Count values of cleanliness_lower
sfo_survey %>%
___