Deriving variables from column headers

1. Deriving variables from column headers

So far, the values that were stored in column names did not require any formatting to become a usable variable after pivoting, except maybe for some data type conversion.

2. Soviet space dogs

However, in data that you'll find in the wild, this is often not sufficient. Let's have a look at this data sample with launch dates of early Soviet space rockets. These rockets had up to two dogs on board, the dataset has their names in two columns and their fate. If we want to have one row for each dog we could use the pivot_longer() function.

3. Soviet space dogs: a basic pivot operation

Using the arguments we saw in the previous lesson we can set the names of the new columns and remove missing values in one go. But what if we want to extract the numbers hidden in the id column as an integer?

4. Soviet space dogs: removing a prefix

The first thing we could do, is specifying the names_prefix argument. When we set it to "name_", this string will be automatically removed from the start of each column name as we move them into the id column. At this point, we have created strings that are interpretable as integers

5. Soviet space dogs: transforming data types

and all we have to do is apply the data type conversion with the names_transform argument. Since the names of the columns we want to pivot all start with "name_"

6. Soviet space dogs: the starts_with() function

we could point to these columns with the dplyr starts_with() function. This function will then pass all column names that start with this string to the first argument of pivot_longer(). While it's impressive what you can do with a single call to the pivot_longer() function, all we've done really is extracting a single variable from the column names while pivoting.

7. Apple revenue: two variables per column name

What if we make things more challenging? Take a look at this sample of Apple revenue per product segment and quarter for the year 2019. Both the year and quarter variables are stored in the column headers.

8. Apple revenue: visualizing issue and solution

We could visualize the situation with the messy column headers like this. To make all data easy to access we want to get to the format on the right.

9. Apple revenue: Advanced pivoting

Again, we can achieve this goal with a single call to the pivot_longer() function. Since there are two variables in the column headers we pass a vector with two new column names to the names_to argument, "year" and "quarter". We also specify the names_sep argument, and pass it the string that separates years from quarters. The other arguments should look familiar by now. Note that we apply the data type conversion to integers on both year and quarter data.

10. Let's practice!

Now it's your turn, let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.