1. Foundations of Tidy Machine Learning
Hi, my name is Dima, and I am excited to welcome you to the Machine Learning in the Tidyverse course.
If you're here then you must already know how easy it is to explore, manipulate and analyze your data with tools from the tidyverse.
The good news is that the tidyverse tools also work exceptionally well for building machine learning models.
2. The Core of Tidy Machine Learning
The reason for this is that the tidyverse tools center around the data frame structure known as a tibble.
What makes a tibble special for machine learning is that it can natively store arbitrarily complex objects using a special column known as the list column.
This is particularly helpful for storing models since the outputs of these models are always complex objects.
With tibbles you can store models in these list columns and, as a result, explore and evaluate them with the rest of the suite of tidy tools.
3. The Core of Tidy Machine Learning
Along with the tibble, the functions in the tidyr and purrr packages form the foundational tools for working with list columns. You will use these tools as part of a framework called the List Column Workflow.
4. List Column Workflow
At its core, this workflow can be summed up in three basic steps.
The first step is to make a list column.
The second step involves using appropriate tools to work with the list column.
And the third and final step is to simplify the list columns into a format that allows further exploration using the familiar tidyverse tools.
These three steps rely on the map family of functions from purrr and the nest and unnest functions from tidyr.
To learn how to use the list column workflow you will
work with the gapminder dataset.
5. The Gapminder Dataset
Unlike previous courses that have used the gapminder package, this course will use a more granular collection of gapminder data adapted from the dslabs package.
This version contains observations for 77 countries across a time period of 52 years. Each observation has six informational elements associated with it, we will refer to these elements as the features of these observations.
6. List Column Workflow
In this video and the exercises that follow it you will learn how to use the nest and unnest functions to manipulate the gapminder data.
7. Step 1: Make a List Column - Nest Your Data
Here is an excerpt of the gapminder data colored by country.
8. Step 1: Make a List Column - Nest Your Data
The process of nesting compacts the chunk of data for each country into a corresponding entry in the new nested data frame. This is accomplished by the nest function.
9. Nesting By Country
To nest the gapminder data by country you first need to use group_by() to group the data by country then use nest() to create a series of nested data frames for each country.
This process creates a new list column named data. Each element in this column contains the corresponding subsetted data frames.
10. Viewing a Nested Tibble
Because the data column in the nested data frame is a list column you can access it directly. This can be very helpful for exploring the data and prototyping your approach.
11. Viewing a Nested Tibble
For example, you can view the fourth list entry, the data for Austria, by specifying the data column and extracting the list with the double brackets.
12. Step 3: Simplify List Columns - unnest()
For the third step of the list column workflow, you need to simplify list columns.
If the list column contains data frames, like in this example, you can simplify it using the unnest() function.
13. Step 3: Simplify List Columns - unnest()
In this example, you can see how the nested data frames were simplified into a data frame with regular columns.
Here the column to unnest is specified as the first argument in the unnest() function.
14. Let's Get Started!
Now it's your turn to practice using these tools.