Predict churn with logistic regression

1. Predict churn with logistic regression

Fantastic, you have done a great job going through the model preparation steps. Now we will delve deeper into the specifics of churn prediction with logistic regression.

2. Introduction to logistic regression

Logistic regression is a supervised learning technique that predicts binary response variables. Logistic regression models the logarithm of the odds ratio. Odds is a ratio of the probability of the event occurring divided by the probability of the event not occurring, or p divided by 1 minus p. For example, if the probability of the churn is 75% then probability of no churn is 25%, hence the odds ratio will be 75% divided by 25%, or 3. The logarithm of 3 is roughly 0.48. The reasons behind the math are beyond the scope of this course, but this approach helps to find the decision boundary between the two classes while keeping the coefficients linearly related to the target variable. Here's the formula of the logistic regression equation based on two input variables and a probability p.

3. Modeling steps

As we've seen before, supervised learning has five steps: Split the data into training and testing. Initialize the model. Fit the model on the training dataset. Then, predict the values on the testing data. And finally, evaluate the model performance by comparing the predicted values with the actual ones in the testing data. Since we have already learned and tested how to split the data to training and testing, we will now move to the second step.

4. Fitting the model

Let's fit the model. First, we import the logistic regression classifier from scikit learn library. Then, we initialize the model instance. Finally, we fit the model on the training data by providing the input features and the target variable to the method called "fit".

5. Model performance metrics

Once the model is fitted, we are ready to assess its performance. In the previous lesson, we looked at high level accuracy metrics, but there are more. While this is not an exhaustive list, these are good metrics to start with, and they are easy to interpret. The one we used previously, is accuracy, which is the percentage of correctly predicted values compared to the actual ones. This includes prediction accuracy for both classes combined - both how many churned and non-churned customers we have correctly labeled. This gives us the main performance of the model irrespective of the class. The other one is called precision, and it measures the number of positive class predictions that were correct. In this case this is the number of customers that were predicted as churned, and did actually churn. The third metric is called recall. It measures the number of total positive class observations that were correctly captured. In this case it is the number of total churned customers that were correctly classified as such.

6. Measuring model accuracy

Now, let's move to calculating the model accuracy on both training and testing dataset. Typically, the testing metric should be lower, as these are the unseen observations, while the training data was used to train the model. First we import the accuracy_score from sklearn.metrics module. Then we predict the labels calling the predict method on the logistic regression instance and passing the input features. Once completed, we call the accuracy score and feed the actual labels first, and the predicted ones afterwards. We store the accuracy scores as separate objects. Finally, we print the rounded accuracy. We can see that the training accuracy is around 81%, while the testing is roughly 80%. This means we have correctly labeled 80% of the customer churn events.

7. Measuring precision and recall

Now, let's calculate precision and recall. The steps are identical to calculating accuracy. First we import the functions. Then, we calculate the precision score for both training and testing data, and round it to 4 decimals. And the same for the recall score. Finally, we print them out. As we can see, these values vary a bit more between training and testing. Also, they lower than accuracy which means the model predicts the minority churn class less accurately than the majority non churned class.

8. Regularization

Now, we'll learn the concept of regularization. The main idea is to introduce a penalty coefficient for model complexity in the model fitting phase. The penalty addresses over-fitting which occurs when we have too many features. In that situation, the model just memorizes the patterns in the training data, but does not predict well on the testing data. Some regularization techniques like L1 also perform feature selection which reduces the number of inputs in the model, simplifies it, and makes it more generalizable to unseen samples.

9. L1 regularization and feature selection

Let's test the regularization now. Logistic Regression from sklearn already performs regularization by default. It is L2 or ridge regularization which only manages over-fitting but does not perform feature selection. L1 regularization, also called LASSO, can be called explicitly. This approach performs feature selection by shrinking some of the beta parameters to zero. We can call it by providing 'l1' to penalty argument, C value which is the inverse of the regularization strength - more on this later. Finally, we feed the 'liblinear' as solver that will be used for L1 regularization. Then, we fit the data as previously. Now, what should be the C value? We will have to optimize the C value by tuning it.

10. Tuning L1 regularization

We will list a number of different C values and build a model for each. Typically we explore C between 0 and 1 although values greater than 1 are also acceptable. Then, we create an empty numpy array with zeros, and add C candidates in the first column. Afterwards, we iterate through the C values, and build logistic regression with each. Then, we store the count of non-zero coefficients, accuracy, precision and recall in the remaining columns, and finally print it to investigate.

11. Choosing optimal C value

We can see that lower C values shrink the number of non-zero coefficients, while also impacting the performance metrics. The decision on which C value to choose depends on the cost of declining precision and / or recall. Typically, we would like to choose a model with reduced complexity that still maintains similar performance metrics.

12. Choosing optimal C value

In this case, C value of 0.025 meets this criteria - it reduces the number of features to 13, while maintaining the accuracy, precision and recall scores close to the ones in the non-regularized model. The other models with lower C values start experiencing decline in the recall metric.

13. Let's run some logistic regression models!

Great work, this has been a longer one, so let's go run some logistic regression models to test our knowledge!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.