1. Checking membership
Nice work on Chapter 1! In Chapter 2, we'll discuss problems that arise in text and categorical data. Let's get started!
2. Categorical data
A variable is categorical when it can only take on values from a predefined set of values. Categorical variables represent distinct groups.
Marriage status, household income category, and t-shirt size are all examples of categorical variables.
As we discussed in Chapter 1, categories are stored as factors in R.
3. Factors
Under the hood, factors are stored as numbers, where one number represents each category. Each number has a corresponding label to make it easier for humans to read and understand output.
For example, marriage status has two categories. Category 1 is labeled as unmarried, and category 2 is labeled as married.
4. Factor levels
Factors have something called levels, which are all the different possible values that a factor can hold. Here, we see that there are four possible values that tshirt_size can have: small, medium, large, and extra-large.
5. Values that don't belong
Since factors have these predefined levels, they cannot have values that fall outside of those levels.
For example, if we asked people what t-shirt size they want, and someone responds with small-slash-medium, we won't be able to process that order since our vendor doesn't make that size.
6. How do we end up with these values?
How do we end up with values that aren't members of the predefined set of categories?
Inconsistencies like this can arise due to data entry errors where data is inputted using free text instead of a multiple-choice system, as well as data parsing errors.
7. Filtering joins: a quick review
Recall that filtering joins are a type of join that keeps or removes observations from the first table, but doesn't add any new columns.
The first kind of filtering join is a semi-join, which answers the question, "What observations of X are also in Y?"
8. Filtering joins: a quick review
The other kind is an anti-join, which answers the question, "What observations of X are not in Y?"
We can use these filtering joins to find and remove values of categorical variables that don't belong.
9. Blood type example
Let's take a look at an example.
We have a data frame called study_data, which contains data from a study about babies' blood types. study_data has the name, birthday, and blood type of each child.
We also have a data frame called blood_types that contains all the possible blood types that a human can have.
10. Blood type example
Notice the problem here? Jennifer has blood type Z-positive, which is not a real blood type. Luckily, we can use the blood_types data frame as our ground truth to fix this.
11. Finding non-members
To find invalid blood types in our study data, we want to find all the blood types in study_data that are NOT in the blood_types data frame. This means we'll need an anti-join.
12. Anti-join
We can use dplyr's anti_join function to get the rows of study_data with a blood type not in the blood_types data frame.
13. Removing non-members
To remove the invalid blood types from our study_data, we want to find all the rows with a blood type in the official blood_types data frame. This means we'll need to use a semi-join.
14. Semi-join
We can use dplyr's semi-join function to get the rows of study_data that have a blood type contained in the official blood_types data frame. This removes the data with the Z-positive blood type.
15. Let's practice!
Time to practice checking membership!