1. Uniformity
Great work on Chapter 2! In this chapter, we'll focus on more advanced data cleaning problems.
2. Uniformity
The first problem we'll tackle is uniformity. Uniformity issues are when continuous data points have different units or different formats. For example, temperature can be written in degrees Celsius or Fahrenheit, weight can be measured in kilograms, grams, or pounds, money can be in US dollars, UK pounds, or even Japanese yen, and dates can be written in different orders.
3. Where do uniformity issues come from?
Uniformity issues arise when you're using multiple data sources that may store data in different ways, or from unstructured data entry that doesn't require specific units or formatting.
4. Finding uniformity issues
Let's take a look at the nyc_temps dataset, which contains daily average temperatures in New York City during April of 2019.
5. Finding uniformity issues
Since outliers can be a sign of uniformity issues, it's usually helpful to do some basic plotting to identify any outliers.
Let's create a scatter plot of the dataset.
This doesn't look quite right - there are three unusually high temperatures. An outdoor temperature of over 50 degrees Celsius would be very concerning.
6. What to do?
There's no one best option for dealing with this data. In order to figure out what to do, we'll need to do some deeper research into the dataset.
There were no big climate events in New York during this month, so there's probably something else up with these values.
After speaking with the person in charge of temperature data collection, we learn that on these three days, the thermometer was broken and data needed to be pulled in from another source. However, this other source measured temperature in Fahrenheit instead of Celsius.
7. Unit conversion
Since we know exactly why these data points aren't uniform, we can adjust them to fit with the rest of the data points. Here, we have the formula to convert temperature in Fahrenheit to Celsius.
We only want to apply this formula to the ones that are in Fahrenheit. To do this, we'll use the ifelse function.
ifelse takes in a condition, the value to use if the condition is true, and the value to use if the condition is false.
Let's add a column to nyc_temps called temp_c. ifelse will check if the original temperature is over 50, convert it to Celsius if it is, and keep the original temperature otherwise.
The first temperature was already in Celsius, so temp_c contains the same value, but 58-point-5 got converted to 14-point-7.
8. Unit conversion
If we create the same scatterplot as before using the temp_c column, the temperatures all range between 4 and 20, which matches what we expected.
9. Date uniformity
Dates can also pose uniformity problems, since there are lots of different ways to write them. In this example, dates are written in three different ways.
We can use special formatting strings to convert them to uniform Date objects so they're all written in the same way. These are the ones we'll need for this dataset, but there are so many others like these - you can always type
this in your R console to get a list of all available date formats.
10. Parsing multiple formats
To convert these all to Date objects, we'll use the parse_date_time function from the lubridate package. We pass in the vector of dates to convert, and a vector of format strings to the "orders" argument. This contains the three different formats that we saw in our data frame.
Just like that, all of the dates are in the exact same format!
If we try to parse a date that's not in one of the formats we specified, NA will be returned instead.
11. Ambiguous dates
Sometimes dates can be ambiguous and you won't be able to tell what format they follow.
For example, is this date in February or April?
As with other cleaning tasks, this is highly dependent on your data and where it came from. One option is to treat these dates as missing. If your data comes from multiple sources, you may notice that one source uses one format and another source uses a different format. From there, you'll be able to make an educated guess about the format of the date based on which source it came from. You can also try and figure out what the format is based on other dates in the dataset. If you know there should be one data point per date, you might be able to figure it out.
12. Let's practice!
Time to practice unit and date conversions!