1. From wide to long data
In the first chapter, we saw how we can separate a messy string column into either columns or rows.
2. Chapter 1 recap
We did so by using the separate() and separate_rows() functions. Their effects can be visualized like this.
We used these functions to get our data into a tidy format, where every variable has its own column, every observation its own row, and each cell just a single value.
But aside from messy string columns, there are plenty of ways for data to be messy.
3. Values in column headers
Let's have a look at this example. The data represents the number of nuclear bombs dropped per country in the early atomic era.
There are several problems with this data, the first is that the column headers are years. Year is a variable, and should be stored in its own column. Second, we can't find in the dataset itself what the values are about. I told you it's the number of bombs but there is no column header telling you so.
4. Values in column headers
We can visualize the issue like this. Note that each color is a variable and that variable names are shown in grey.
Where we want to get to, is this. A longer data frame, with appropriate column headers and a single variable per column.
5. The pivot_longer() function
We can achieve this goal, with the pivot_longer() function.
We've passed it a single argument, a range with the columns to put in a single column. A range is created by specifying the leftmost column followed by a colon and then the rightmost column.
Note that we used backticks around the years, this is because valid column names can't start with a number. However, when you put an invalid column name within backticks you can still use it.
6. The pivot_longer() function
An alternative way to specify which columns to pivot, is by explicitly passing them all as a vector. This is the safest option since you have full control and the result does not depend on the positions of the columns like it does with the range option.
7. The pivot_longer() function
If you don't like typing much, you can specify just the columns not to pivot with a minus sign.
Looking at our output, you'll notice that the two new columns have been given the names name and value. These are just default values which you can easily overwrite.
8. pivot_longer() arguments
You can do so with the names_to and values_to arguments. Names, refers to the original column names which were years in this case while values were the numbers of bombs.
9. pivot_longer() arguments
We can tweak the output a little bit more, if we don't want any missing values in our number of bombs columns, we can set values_drop_na to TRUE.
At this point, there is just one last detail that we can improve. When the year values were moved from the column headers to their own column, tidyr assumed they were of the character data type.
10. pivot_longer() arguments
We can specify the correct data type to use with the names_transform argument. When we pass it a list where year equals as-dot-integer, you'll notice that the output now gets the data type correct.
11. Let's practice!
Now it's your turn to try out the pivot_longer() function!