Get startedGet started for free

Understanding your qualitative variables

1. Understanding your qualitative variables

In this lesson, we'll introduce our dataset, look at converting and understanding our qualitative variables, and learn some new dplyr functions.

2. Introduction to the dataset

In the previous exercise, we were working with a dataset called multiple_choice_responses. This is a sample of data from the Kaggle 2017 Data Science survey. Kaggle is an online platform for predictive modeling and analytics competitions. This survey was given to current and aspiring data scientists, analysts, data engineers, and others in the data science field. They got about 16,000 responses on questions ranging from demographic information to the usefulness of different learning platforms to what languages they wanted to use in the coming year. We'll be using this dataset throughout the course. Just like with numerical variables, our first step when looking at categorical variables should be to get a high-level summary. Instead of numerical summaries, like the mean and the standard deviation, we can look at the number of categories and the name of each. But as we saw when examining our dataset, currently some of the variables are characters, not factors. How can we change this?

3. Converting characters to factors

First, we need to identify which columns are characters. We can use is dot character for this. Next, we can use the function as dot factor to change columns from characters to factors. If we want to do this for all character columns, we can take advantage of dplyr's mutate() and across() functions. Within mutate, we use the function across to specify we want to apply something across all columns where a condition is met. In this case, we'll apply the function as dot factor across all columns where is dot character is true.

4. Summarizing factors

Once our columns are factors, we want to find out more about each one. We can use two functions: nlevels() and levels(). nlevels() will give us the number of levels of a factor, and levels() will give us their names.

5. Summarizing factors

What if we want to scale up and check the number of levels for all factor columns in a dataset? You’ve probably used dplyr's summarize() before to take summary information, like the mean, of a single column. If we want to apply a summary function, meaning one that returns a single number, to all columns that meet a certain condition, we can use dplyr's summarize() with across() and where(). This works just like across() with mutate(); we first check if the column is a factor and, if it is, get the number of levels.

6. everything()

When we want to apply a function to all variables, we use the function everything(). For example, let's say we want to select all columns. We would pipe MultipleChoiceResponses to select(everything()). Or if we want to pivot all of the columns in the dataset longer, we would use pivot_longer() everything().

7. Let's practice!

Time to put this into practice.