1. Cox PH model with constant covariates
In order to check the effect of multiple customer characteristics on the risk of churn, I am introducing the Cox proportional hazards model to you.
2. Model assumptions
Here, too, we have to pay attention to some assumptions that have to hold in order for the model to deliver reliable results.
The predictors are linearly and additively related to the log hazard, so our model has this form (where lambda is the hazard function).
We have no assumption regarding the shape of the hazard function.
The assumption already contained in the name of the model is the proportional hazards assumption. This says that the predictors are not allowed to interact with time. Hence, the relative hazard function, e to the x beta, must remain constant over time.
Enough theory, let's jump into `R`...
3. Fitting a survival model
First of all, we specify the units that time is measured in as months. Then, for summary statistics and predictions, we need to determine the distributions of the predictor variables. We do this using the `datadist()` function from the `rms` package. In order to make the results permanently accessible, I add them to the global options.
Now I am all set to specify the model using the `cph()` function from the `rms` package. Note that this is a slight modification of the `coxph()` function in the `survival` package.
In order to reduce complexity, we choose only some variables from the dataset that seem plausible for explaining churn. On the left-hand-side of the formula we still work with the survival object. To the right, the variables are added. Note that we will need the arguments specified later on during the analysis.
4. Summary of survival model
Printing the result gives us some descriptive statistics, some goodness of fit measures, and the coefficients and their significance.
In a cox proportional hazards model, coefficients are interpreted similar to a logistic regression. From the untransformed coefficient I can only draw conclusions about the direction of the effect. Looking at the coefficient for `SeniorCitizen` I see that the risk of churning is higher for senior citizens compared to non-senior citizens.
5. Interpretation of coefficients
The model's coefficients are stored in the element named "coefficients" of the fitted model. By transforming them using the exponential function, interpretation gets easier:
The hazard to churn increases by the factor of one point two three or 23 percent for senior citizens compared to non-senior citizens. One point two three in this case is called the hazard ratio.
For continuous covariates, interpretation changes slightly: A one-unit increase in, for example, monthly charges decreases the hazard of churning by a factor of point nine nine.
6. Survival probabilities by MonthlyCharges
In order to visualize the predictor effects, the `rms` package contains a `survplot()` function. It plots the survival probability depending on different levels of one variable when holding the other coefficients constant. It takes the fitted survival model and the variable you are interested in as input, and allows you to label the lines by using the argument `label.curves`. First, you see the predictor effect for monthly charges. The higher the monthly charges, the higher the survival probabilities.
7. Survival probabilities by Partner
Next, the effect of having a partner or not is visualized. You see that customers having a partner hold on to a contract longer than customers not having a partner.
8. Visualization of hazard ratios
A nice way to visualize the hazard ratios of the coefficients is given by the `plot` method for the `summary` of a cox proportional hazards model with the argument `log` set to `TRUE`.
Here, hazard ratios are given with the respective confidence intervals.
9. Let's practice!
Now let's try some examples.