Get startedGet started for free

Model Building and Evaluation with tidymodels

1. Model Building and Evaluation with tidymodels

Welcome back. To this point, we've used recipe objects in tidymodels. Since a main objective of supervised dimensionality reduction is to improve model performance, let's learn how to use workflows to combine recipe and model objects.

2. Model fitting process

To begin, let's review the general model building process. First, we split out training and testing sets so we can evaluate the trained model on new, never-before-seen data.

3. Model fitting process

Then we preprocess or prepare the train and test data.

4. Model fitting process

Then we fit the model to the prepared training data.

5. Model fitting process

Lastly, we evaluate the trained model on the testing data and, if needed, refit the model to improve its performance.

6. Model fitting with tidymodels

The tidymodels package accommodates this process. It provides functions to split the data. It also has workflows, which enable us to bundle a recipe and a model together in a piped fashion.

7. Model fitting with tidymodels

As we've seen, a recipe is a sequence of preprocessing steps to prepare the data for modeling. We've already used steps like step_zv(), step_nzv() and step_corr(). There are many others.

8. Model fitting with tidymodels

A model can also be added to the workflow. The model object abstracts different model implementations into a single, uniform interface to make it easy to swap different kinds of models — like linear and logistic regression and decision trees — into the workflow. As we'll see, tidymodels provides the functions to evaluate model performance as well.

9. Splitting out train and test sets

Now, let's implement the model building process with tidymodels code. The initial_split() function sets up the parameters for creating the training and testing sets. The prop argument sets what percentage of data we want in the training set. We set strata to the target variable. This ensures that different target values appear in the train and test sets in equal proportions. We pipe split to training() and testing() to extract training and testing sets, respectively.

10. Creating a recipe and a model

Then we create a recipe object by specifying the model formula and setting data to train. As we've seen, we add different preprocessing steps to the recipe. Here we remove features with fifty percent or more missing values, scale the numeric predictors before removing low-variance features, and prepare the recipe to estimate the parameters using the training set. Then we create the model specification — or model spec. Think of a model spec as an untrained model. In this case, we initialize a logistic regression model and call set_engine("glm") to estimate the logistic regression parameters with the glm package.

11. Create and fit the workflow

With the recipe and model objects created, we create a workflow object and add the recipe and model spec objects to it using add_recipe() and add_model(). Then we call fit() on the workflow to fit the model using the training data.

12. Evaluate the model

To evaluate the model performance, we bind the actual and predicted values together into credit_pred_df. We pass predict() the fitted workflow object and the testing set to get the predicted values. We select the target variable, credit_score, from the test set to get the actual values. We pass credit_pred_df to f_meas() with the column names of the actual and predicted values to produce the F1 measure. And that completes the model building process with tidymodels.

13. Explore the recipe with tidy()

In conclusion, the tidy() function can extract information from both the recipe and the model. To explore the effects of a recipe step, we pass tidy() the recipe object and the step number. The first step, step_missing_filter(), removed the age and outstanding_debt features because they had too many missing values.

14. Explore the model with tidy()

If we pass the fitted workflow object to tidy(), it returns information about the model coefficients.

15. Let's practice!

Now it's time to practice.