1. Introduction to qualitative data
Hi, my name is Emily Robinson, and welcome to this course where we'll learn how to effectively wrangle and plot non-numerical, or qualitative data.
2. Course overview
We'll start with learning about how to identify and inspect these variables in a dataset. Then we'll move to using the forcats package by Hadley Wickham to manipulate the variables by renaming categories, changing their order, and collapsing multiple groups into one. In the third chapter, we'll see how we can make effective visualizations by combining forcats with other tidyverse packages like dplyr, tidyr, stringr, and ggplot2.
3. Final chapter
In our final chapter, we'll recreate this visualization from the FiveThirtyEight blog using all the tools we have learned. We've accessed this data from the FiveThirtyEight R package, which provides access to the code and datasets published by FiveThirtyEight.
4. What are qualitative variables?
This course focuses on two types of qualitative data: categorical, or nominal data, and ordinal data. Categorical data are data that fall into unordered groups, while ordinal data have an inherent order but not a constant distance between them. Both types of data have a fixed and known set of possible values.
5. Categorical (nominal) data
One example of categorical data is a person's occupation. You might have a survey that has people pick their occupation from a list of 30, such as doctor, teacher, or engineer, with an extra category for other. We could think of ways to order this data, such as by median salary or years of education needed, but they don't have an inherent order.
6. Ordinal data
On the other hand, if we asked people about their annual income and offered four choices, "0-$50,000", "$50,000-150,000", "$150,000-$500,000", and "more than $500,000", this would be an ordinal variable, because these groups go from smallest to largest. However, there's not a constant distance between each group - if you were asked to construct a mean salary from this data, you couldn't do it. This is what makes it qualitative instead of quantitative data.
7. Qualitative variables in R
R has two ways to represent qualitative variables: as characters and as factors. There are some differences under the hood, but generally, you'll use factors for categorical and ordinal variables and characters otherwise.
For example, names would best be represented as characters, because there's no limit to the possible number of names! On the other hand, a survey question where you can select which programming languages you know among 40 possible answers can be represented as a factor.
8. Qualitative variables in R
So how do you know whether your variable is currently stored as a factor or character? There are two main methods. First, you can look at the printed output of your tibble. If you store your dataset as a tibble, a modern dataframe, each column type is automatically printed out below or next to the column name. Let's print out the college_all_ages dataset, another one from the FiveThirtyEight package.
We can see that there aren't any factors - major and minor category are both character columns.
9. Qualitative variables in R
The second way is to use the function is dot factor. This will be true or false depending on whether, you guessed it, the input is a factor. When we check the major_category column, we see, just as above, it's not a factor.
10. Let's practice!
Now that we've been introduced to qualitative variables in R, let's try working with some examples.