Get startedGet started for free

Nesting data for modeling

1. Nesting data for modeling

Welcome to the final lesson of this course! In this chapter, we've seen different ways of turning nested data into rectangular data, which is our preferred format for analysis. However, sometimes nested data and the rectangular format can go hand in hand and make your analysis more elegant.

2. USA Olympic performance

Let's illustrate this with an example. You're looking at a data sample with the numbers of participants and the number of medals won by the USA at the summer and winter Olympics.

3. USA Olympic performance

When we plot this data, we see a positive correlation, the USA wins more medals when more USA athletes participate.

4. Modeling the pattern

Let's say we want to model this pattern, we can use a linear model using the lm() function and specify we want to explain the variance in n_medals based on n_participants plus an intercept of zero, since we know that you'll win zero medals if you don't participate. Don't worry if you're new to modeling, we're just interested in the model output which looks like this. It found a coefficient of 0.463, meaning that for every participant, the USA wins almost half a medal on average. Note that this output is not in a tidy format.

5. Untidy model statistics

It gets worse when we call the summary() function on the fitted model. We now get a confusing overview with a lot of text mixed with numbers.

6. The broom package

This problem can be solved using the broom package. Its mission is to turn messy outputs of built-in R functions into tidy tibbles. When we use its glance() function on the model, we get a tibble with the model performance statistics. When we use its tidy() function, we get an overview of the coefficients estimated by the model, it's just one row in this case and we recognize our earlier result of 0.463.

7. broom + dplyr + tidyr

Now let's see how we can plug this trick into a pipeline to train multiple models using nested data. The tidyr nest() function will nest a subset of a data frame based on some grouping variable, country, in this case. The result is a list column, data, that has tibbles inside.

8. Nested tibble & purrr::map()

We can now use the map() function from the purrr package to apply a function to each tibble individually. We tell map() to iterate over the data column and to apply a function that fits a linear model on each nested tibble. The result is that we now have a new list column named fit with the fitted linear model.

9. Working with nested tibbles

Using purrr's map() function once more, we can apply broom's glance() function on this fitted model. This output too gets added to a list column which we named glanced.

10. Unnesting model results

The magic happens when we now unnest our model results using tidyr's unnest() function on the glanced column. We get a tidy overview of the model metrics.

11. Unnesting model results

We can do the same thing for broom's tidy() function if we want to inspect the estimated coefficients and more. Let's review what we're doing here. We specify a variable to group by, nest the groups, apply different functions on the nested data using the map() function, and then unnest the results.

12. Multiple model pipeline

If we now want to create more than one model, all we have to do is add a grouping variable at the start. Here, we've added the season variable to create models for both the winter and summer Olympics. The output remains tidy and it's easy to compare models.

13. Let's practice!

Now it's your turn for the final exercises, let's practice!