Machine learning workflows

1. Machine learning workflows

In this section, we will introduce decision tree models and learn how to create workflows that combine models and recipes into a single object.

2. Classification with decision trees

Decision trees differ from logistic regression by their ability to segment the predictor space into rectangular regions. A popular algorithm for creating these regions is known as recursive binary splitting. To demonstrate this algorithm, let's use the lead scoring dataset where we have customers who either did or did not purchase products based on their website behavior.

3. Classification with decision trees

The algorithm makes a series of horizontal or vertical cut points, known as splits. In this example, the first split is horizontal along the total time on website variable.

4. Classification with decision trees

Next, a vertical split is created in the top portion of the first split along the total website visits predictor.

5. Classification with decision trees

And finally, another vertical split along the total website visits predictor is added in the bottom portion of the first split.

6. Classification with decision trees

This produces 4 distinct rectangular regions. The decision tree will predict the majority class in each region. For some datasets, this approach may produce better predictions when compared to the linear decision boundaries of logistic regression models.

7. Tree diagrams

Tree diagrams are another way to visualize the prediction regions of decision trees and are made up of a series of nodes. Interior nodes are the splits of a decision tree and are represented by the dark boxes in the diagram below. Terminal nodes provide the model predictions and are represented by the green and purple boxes. Comparing a tree diagram to the plot of rectangular regions in the lead scoring dataset, we see that interior nodes correspond to the dashed lines in the plot while terminal nodes correspond to the 4 rectangular regions.

8. Model specification

A decision tree model is specified with the decision_tree() function. The common engine is 'rpart' and the mode can be either classification or regression. For the lead scoring data, we need a mode of classification.

9. Feature engineering recipe

From our previous work, we have our leads_recipe object which removes multicollinearity, normalizes numeric predictors, and creates dummy variables for nominal predictors. We have two R objects to manage during the modeling process, our decision tree model and our feature engineering recipe. Combining these into a single object would make the process easier to manage.

10. Combining models and recipes

The workflows package provides the ability to combine models and recipes into a single object. To create our workflow, we initialize an empty workflow with the workflow() function, then add our decision tree model with add_model() and finally our recipe with the add_recipe() function. This produces a workflow that bundles our model with our feature engineering steps.

11. Model fitting with workflows

To train our workflow, we pass it to last_fit() and provide our leads_split object. Like before, performance metrics can be gathered with the collect_metrics() function. Behind the scenes, these few lines of code created training and test datasets, trained and applied our recipe, fit our decision tree to the training data, and calculated performance metrics on the test dataset. Pretty amazing!

12. Collecting predictions

The collect_predictions() function will create detailed prediction results from a trained workflow for use in yardstick metric functions.

13. Exploring custom metrics

We can create a custom metric function that includes the area under the roc curve, sensitivity, and specificity using the metric_set() function. When we pass our predictions data to our function, we see that our decision tree model had a ROC AUC of 0 point 775 on the test data.

14. Loan default dataset

In the exercises, you will be working with the loans_df dataset, which contains financial data on consumer loans at a bank. The outcome variable is loan_default and indicates whether a customer defaulted on their loan or not.

15. Let's practice building workflows!

Let's practice building workflows!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.