1. Data type constraints
Hi and welcome to the course. My name is Maggie, and I'll be guiding you as you learn how to clean data in R.
2. Course outline
In this course, you'll learn how to diagnose and locate different problems in your data and how they can can come up during your workflow.
3. Course outline
You'll also learn about what can go wrong if your data isn't properly cleaned,
4. Course outline
and how to address different types of dirty data.
5. Course outline
In this chapter, we'll discuss the most common problems you might encounter and how to address them. Let's get started!
6. Why do we need clean data?
To understand why we need clean data, let's take a look at the data science workflow.
In a typical data science workflow, we usually access raw data first, explore and process it, then develop insights. Finally, we report these insights.
7. Why do we need clean data?
Dirty data can appear before we even access the data, due to mistakes such as typos and misspellings.
8. Why do we need clean data?
If we don't address these mistakes early on, they'll follow us through our entire workflow, which means we could end up drawing incorrect conclusions.
9. Data type constraints
You've probably encountered different types of data before, such as text, numbers, categories, and dates.
Each of these data types is treated differently, so if each variable isn't the correct data type, we risk compromising our analysis.
10. Glimpsing at data types
Let's look at an example. Here's a data frame containing revenue generated and quantity of items sold for different sales orders.
To look at the data types of each column in "sales", we can load the dplyr package and use glimpse. This gives us the data type for each column. The order ID and quantity have data type "dbl" or double, which is the same thing as numeric - double is just a way of specifying how many decimal places the number can have. However, the revenue column has the data type character, when it should be numeric.
11. Checking data types
We can use the is-dot-numeric function on the revenue column and see that it's not numeric.
Another way to do this check is to use the assert_is_numeric function from the assertive package. This provides extra protection because it will throw an error and stop our script from running, so we'll immediately know that something's amiss.
If we call assert_is_numeric on something that is numeric, nothing is returned.
12. Checking data types
All data types have an is-dot function that returns TRUE or FALSE and an assert-is function that returns nothing or an error.
13. Why does data type matter?
We can use the class function on the revenue column to see that it's a character type.
If we want to know what the average revenue is, we get an NA and a warning, since taking the mean of text doesn't make much sense.
This is why it's important to check that our data types are what we expect. Otherwise, we might think we're getting one thing, in this case an average, when we're actually getting something completely different, which is an NA. We'll need to convert this column to a numeric type in order to get the average.
14. Comma problems
Printing revenue shows a comma in each number. We'll need to remove them before converting the strings to numbers.
15. Character to number
This can be done using str_remove from the stringr package. The first argument is the string that we want to remove from, which is the revenue column. The second argument is what we want to remove, the comma.
If we look at revenue_trimmed, all the commas are gone!
To convert revenue_trimmed to a numeric type, we pass it into as-dot-numeric.
16. Putting it together
We can put it all together into the sales data frame by calling mutate. We create a new column called revenue_usd using str_remove and as-dot-numeric.
17. Same function, different outcomes
Now, taking the mean of revenue_usd gives us the average revenue, instead of the NA that we got earlier.
18. Converting data types
Just like the is-dot and assert-is functions, there are as-dot functions to convert to any data type.
19. Watch out: factor to numeric
Be careful when converting a factor to a numeric. Factors are a data type that represent a limited set of possible categories. Here, we have a product_type vector, which is a factor. 1000 represents clothing, 2000 represents food, and 3000 represents electronics.
If we call as-numeric on product_type, we get these numbers, which isn't what we're looking for. This is due to the way that factors are encoded in R. Instead, we need to use as-character first, and then as-numeric.
20. Let's practice!
Time to use your new knowledge of data types!