Get startedGet started for free

Building a predictive model

1. Building a predictive model

Alright, we are now ready to build models that predict customer churn based on the features you extracted. The aim is to build an analytical model predicting a target measure of interest. This is also referred to as predictive analytics or supervised learning.

2. Predictive modeling

Do you remember Cecelia, the data scientist colored gray in the network of data scientists? We can now use our dataset and supervised learning to build a predictive model that can predict whether she prefers R or Python. Taking another look at our dataset, we add the preference as a feature as you can see here.

3. Predictive modeling

For convenience, and for R to handle the data better, we indicate the preference with 0 and 1, where 1 denotes R and 0 denotes Python, and call the column R, as you can see here. The column R is the label or target of our dataset and the one we build the predictive model for. Before building the model we split the data into a training and a test set, as you see here. The test set consists of only Cecelia and the training set of the other people. In the exercises, you will split the data randomly. For the classification, we use two techniques: logistic regression and random forests. Both have been very popular in industry. Logistic regression is a comprehensible, white box technique, whereas random forests is a powerful, black box technique.

4. Logistic regression

First, logistic regression is a statistical technique that can be applied to a binary dependent variable and is thus a supervised learning technique. It is related to linear regression, but instead of estimating a linear relationship between the dependent and the independent variables, a transformation function, that you see in this plot, is applied to the linear combination of the independent variables. As a result, the output is always between 0 and 1, and can thus be interpreted as a probability (e.g., churn probability). To build a logistic regression model in `R` we use the `glm` function. glm stands for generalized linear model, which means that it is an extension of linear regression. The first argument is the formula of the model, or the relationship between the variables we want to impose. In this case, R is our response variable, so we write `R~` and then on the right-hand side come the independent variables. Here we have R as a function or degree and PageRank If you want all the variables you simply write `R~.` like here. The second argument is the dataset, or `training_set` in our case, and the third argument is very important `family=binomial` tells `R` that we want to do logistic regression.

5. Random forests

Another commonly used classification technique is random forests. It is a very powerful ensemble technique that is robust to outliers and noise. It is based on a number of decisions trees and makes a prediction by aggregating the results of all the trees. In `R` you can use the `randomForest` function in the `randomForest` package. As with logistic regression, you indicate the formula of your model as the first argument of the function, and the dataset, in this case, `training_set` as the second argument. The function has many other parameters that can be specified. Random forests come at the cost of being black box models, which means that they are hard to interpret. We cannot tell how the variables affect the prediction. However, we can quantify the importance of the variables and plot them as you see here using the function `varImpPlot` and the random forest model as an argument. The result is a plot like this one, where `pPageRank` has the highest importance and degree the lowest.

6. Let's practice!

Now you will apply the classifiers to the churn dataset.