Classification models
1. Classification models
In the previous chapter, we predicted continuous outcome variables with regression models. This chapter focuses on the other branch of supervised machine learning - classification.2. Predicting product purchases
Classification models are used for predicting categorical outcome variables. An example might be predicting whether a customer will purchase a product based on the time they spent on a company website and their total website visits. In the dataset below, each row represents a customer and the outcome variable, purchased, consists of two categories, yes and no. Plotting this data and coloring the points by the outcome variable reveals that customers who do purchase products tend to spend more time on the website.3. Classification algorithms
Instead of predicting numbers, classification algorithms produce non-overlapping regions where the same categorical outcome is predicted for all combinations of predictor values.4. Classification algorithms
Logistic regression is a classification model that separates the groups within the outcome variable with a linear function along the set of possible predictor values, also known as a decision boundary.5. Lead scoring data
Throughout this chapter, we will train a logistic regression model on the leads_df tibble, which contains information on whether customers purchased a product based on their website behavior and other demographics.6. Data resampling
As before, the first step in fitting a model is to create training and test datasets from the original data. For the leads_df data, we create a data split object, leads_split, with the initial_split() function and stratify by our outcome variable, purchased. This ensures that the proportion of yes/no values in the outcome variable is similar in the training and test datasets. Then we pass the data split object to the training() and testing() functions to randomly divide our data.7. Logistic regression model specification
The logistic_reg() function is the general interface to logistic regression models in parsnip. To specify our logistic regression model, we call the logistic_reg() function, pass it to set_engine() where we select the commonly used 'glm' engine, and finally pass it to set_mode() where we set the mode to 'classification'.8. Model fitting
As in the regression setting, once a model is specified, the fit() function is used for model training. To train our model, we pass it to the fit function and provide our model formula. Here we are predicting purchased using total_visits and total_time as predictor variables. We also pass the leads_training tibble to the data argument.9. Predicting outcome categories
To obtain model predictions, we pass our trained model, logistic_fit, to the predict() function and provide leads_test to the new_data argument. We also need to add type is equal to 'class' in order to obtain predicted outcome categories. The predict function always returns a tibble and the predicted categories in a column named dot pred_class.10. Estimated probabilities
When we set the type argument to 'prob' within the predict() function, we get a tibble with the estimated probabilities for each outcome variable category for each row in our test data. We will always get one column per category in our outcome variable with the naming convention dot-pred underscore outcome_category. For our model on the leads_df data, we get the columns dot-pred_yes and dot-pred_no.11. Combining results
To evaluate model performance with yardstick, we will need to combine the outcome variable from the test dataset with the predicted categories and estimated probabilities. As before, this can be done with the bind_cols() function. In the next section we'll explore how to use this results dataset with yardstick metric functions.12. Telecommunications data
Throughout the exercises in this chapter, you will be fitting models to the telecom_df dataset which contains information on customers of a telecommunications company. The outcome variable is canceled_service and indicates whether a customer canceled their cellular and internet service.13. Let's practice!
Let's practice building classification models!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.