Pivoting data

1. Pivoting data

Now let's talk about pivot tables!

2. Pivot tables

Pivot tables provide another way how to summarize data. They are a statistical tool used to summarize and reorganize selected rows and columns of a dataset. In this video, we will use pivot tables to compare weights of female and male penguins of different species. Let's see what how to do that with the pivot tables!

3. Pivot table using unstack()

In Julia, we can create pivot tables using the unstack function. It takes the DataFrame as its first argument, followed by a column or columns which contains data we want to use as indices or names of the rows, a column which values will be used for the different columns, and a column that will provide values to fill the table. If there are multiple values for the row-column pair, then we need to specify how we want to aggregate them. That's where the combine keyword comes in. Combine-equals-mean will average the values and so on. So how does it look in practice?

4. unstack() to pivot

Let's look at the penguins dataset. For rows, it makes sense to use the species column. Let's investigate if there are differences between sexes when we look at their weight. We then call unstack and pass penguins, the species, sex, and body mass columns. As there is more than one value for every species-sex pair, we must include combine-equals and the function we want to aggregate with. Let's go with the median. The result is a DataFrame like so. Males penguins tend to be heavier than females.

5. Pivoting on multiple columns as rows

But wait, is there a difference between the different islands? To find out, we can pass both species and island columns, in a vector, as the row selectors, leaving the rest the same. The resulting DataFrame has three rows for Adelie penguins, each for a different island.

6. Pivoting on multiple columns elsewhere

Now, you might wonder if it is possible to use multiple columns for the columns or the values. At the moment, Julia and unstack don't allow it.

7. Missing values

What happens if a row-column pair does not have a corresponding value attached to it in our dataset? unstack automatically fills it with the missing value. In the penguins-missing DataFrame, some data was lost, and thus we are missing data for female Gentoo penguins. In the pivot table, the corresponding value reads missing.

8. Replacing missing values

If we want to replace it with something else, we can use the fill keyword and provide a new value to be used. Here, we replaced the missing values with minus one.

9. Pivot tables are DataFrames

We can treat the resulting pivot table as a normal DataFrame. We can save it to a variable, sort it, select only certain columns, etc.

10. Let's practice!

Are you ready to pivot to new highs? Let's practice in the exercises!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.