Get startedGet started for free

The tidymodels ecosystem

1. The tidymodels ecosystem

Hi, my name is David Svancer. I am a data scientist and adjunct professor of Business Analytics at George Mason University. This course will introduce you to tidymodels, a powerful R package for machine learning.

2. Collection of machine learning packages

Tidymodels is a collection of R packages designed to support machine learning model development.

3. Collection of machine learning packages

The rsample package supports data resampling, and is used for creating random subsets of a dataset for different activities in the modeling process.

4. Collection of machine learning packages

The recipes package contains functions for transforming data for modeling. This step is often called feature engineering.

5. Collection of machine learning packages

The parsnip package is an interface to the vast modeling libraries available in R. It is used for specifying and fitting models as well as obtaining model predictions.

6. Collection of machine learning packages

The tune and dials packages provide functionality for fine-tuning models in order to achieve optimal prediction accuracy.

7. Collection of machine learning packages

The yardstick package provides metrics for evaluating the quality of model predictions. Tidymodels was designed to easily iterate over model fitting, tuning, and evaluation, all with a unified R syntax!

8. Supervised machine learning

Tidymodels is primarily used for supervised machine learning, where algorithms learn patterns from labeled data. There are two types of supervised machine learning. Regression deals with predicting quantitative outcomes such as home selling prices. Classification deals with predicting categorical outcomes, such as whether an employee will leave a company. The following dataset can be used for this task, where each row represents an employee and each column is a characteristic of that employee. The left_company column provides the labels, or true outcome, for each row and is known as an outcome variable in tidymodels. All other variables are assigned the role of predictor variable.

9. Data resampling

The first step in modeling is to randomly split the original data into training and test datasets. This guards against a phenomenon known as overfitting, where a model memorizes the patterns in a dataset and then performs poorly on new data. Commonly 75% of the data is allocated into training and 25% into test. The training data is used for feature engineering and modeling while the test data is used to estimate the model's performance on previously unseen data.

10. Fuel efficiency data

We will be using the mpg dataset to demonstrate regression modeling with tidymodels. It contains fuel efficiency data for over 200 popular cars. The outcome variable is the hwy column, which represents the average highway miles per gallon of each car.

11. Data resampling with tidymodels

To begin the modeling process, we load the tidymodels package and create a data split object with the initial_split function. A data split object specifies instructions for creating training and test datasets. Initial_split takes a dataset as the first argument, the proportion to allocate to training as the second, and a stratification variable. The outcome variable is used for stratification so that its values have a similar range in both datasets. This prevents fitting a model to data that is different from the typical data it will be given in the future. By passing mpg_split to the training() function, we create the mpg_training dataset that we'll use to train our model, which contains a random 75% of the data. Passing mpg_split to the testing() function creates mpg_test, which we'll use to evaluate our model's performance.

12. Home sales data

In the chapter exercises, you will be working with the home_sales data which contains information on homes sold in the Seattle, Washington area between 2015 and 2016. The outcome variable is the selling_price column.

13. Let's practice!

Let's practice using tidymodels to create training and test datasets!