Correcting inconsistency
Now that you've identified that dest_size
has whitespace inconsistencies and cleanliness
has capitalization inconsistencies, you'll use the new tools at your disposal to fix the inconsistent values in sfo_survey
instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.
dplyr
and stringr
are loaded and sfo_survey
is available.
This exercise is part of the course
Cleaning Data in R
Exercise instructions
- Add a column to
sfo_survey
calleddest_size_trimmed
that contains the values in thedest_size
column with all leading and trailing whitespace removed. - Add another column called
cleanliness_lower
that contains the values in thecleanliness
column converted to all lowercase. - Count the number of occurrences of each category in
dest_size_trimmed
. - Count the number of occurrences of each category in
cleanliness_lower
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
# dest_size_trimmed: dest_size without whitespace
mutate(dest_size_trimmed = ___,
# cleanliness_lower: cleanliness converted to lowercase
cleanliness_lower = ___)
# Count values of dest_size_trimmed
sfo_survey %>%
___
# Count values of cleanliness_lower
sfo_survey %>%
___