Get startedGet started for free

Examining common themed variables

1. Examining common themed variables

In this lesson, we’ll be using the tidyverse packages dplyr and tidyr to select common themed variables together.

2. Tidying data

The Kaggle data science survey includes groups of questions, like how often people faced different work challenges. In the original dataset, each of these questions are stored as different variables. Sometimes we need to rearrange our data to make it easier to work with. Hadley Wickham, who created the R packages we’re using in this course, introduced a concept that can help with this called "tidy data." Tidy data is when data is stored in the format such that each row is an observation and each column is a variable. In this case, we can tidy our data by changing its format from wide to long. Instead of having each work challenge in a separate column, we can have rows with one column listing the type of work challenge and another column, the response.

3. Selecting and pivoting data longer

To do this, let’s first select all the relevant columns. Here, we can take advantage of the fact that all the columns about work challenges have the phrase "WorkChallengeFrequency" in them. We can use a dplyr helper function, contains(), along with select() to select all the column names that contain this phrase. Now we’ll use the tidyr function pivot_longer() to change the dataset from wide to long. The first argument is which columns we want to pivot, which is everything(). The next two arguments are the names of the new columns we want, with the old column names now in the first column and the old values in the second. Let’s take a look at what our dataset looks like now. Well, we see the entries in the WorkChallenge column are pretty long. We don’t need each of them to have the phrase “WorkChallengeFrequency” in them, that information is conveyed by the column titles! So let’s take that out so each entry just says what the exact work challenge is.

4. Changing strings

To do this, we can use the stringr function str_remove() with mutate(). str_remove() takes the variable we want to change as the first argument and what part of the string we want to remove as the second. Here we apply str_remove() to the column work challenge and remove the string "WorkChallengeFrequency" and now get the entries like "Politics" instead of "WorkChallengeFrequencyPolitics."

5. if_else() and summarizing

Finally, we want to get some summary statistics to compare the frequency of different work challenges. One thing we can do is dichotomize the response variable. Before we do that though, let's make sure to filter out the NAs! Let’s say we were really interested in comparing whether people faced the challenge “only rarely” or “sometimes” versus “often” or “most of the time”. We can use the dplyr function if_else() with mutate() to change our response variable to 0 or 1. We’ll say that if the response is either “Most of the time” or “Often”, make it 1, otherwise, make it 0. Then, we can group by the question and use summarize() on the mean response for each question to get the percentage of people considering it a frequent work challenge.

6. Let's practice!

Let's practice these new skills!