1. Checking model assumptions and making predictions
It is time to validate the model. Then, we will see how to make predictions.
2. Test of PH assumption
We start by validating the proportional hazards assumption, using the `cox.zph()` function from the `survival` package. The output from the `print()` function is shown on the slide.
If the p-value of the test is less than .05, we can reject the hypothesis that the given variable meets the proportional hazards assumption.
According to the test, several predictors like Partner or MonthlyCharges violate the proportional hazard assumption, hence, their effect changes over time.
3. Proportional hazards for Partner
Visualizing the estimates of the coefficient beta(t) dependent on time gives further insights. If the proportional hazards assumption holds, beta(t) is a horizontal line.
Look at the plot for `Partner`, by handing the test object and the name of the variable of interest to the plot function. The increase in the coefficient is marginal.
4. Proportional hazards for MonthlyCharges
In the plot for `MonthlyCharges` the coefficient has the tendency to change signs as time goes on.
5. General remarks on tests
Generally, the following should be considered:
The test provided by the `cox.zph()` function is rather conservative and sensitive to the number of observations.
A violation where the coefficient changes signs is certainly worse than one where the coefficient varies between some positive values.
6. What if PH assumption is violated?
If the proportional hazards assumption is violated for a certain variable, a stratified cox model makes sense. This model allows the shape of the underlying hazard to vary for the different levels of the variable. Categorical variables are added to the argument `stratum`, continuous variables are classed first. The regression coefficients are modeled across the strata.
Another solution is to model time-dependent coefficients by dividing the time under observation into different periods for which we assume the coefficients to be constant.
7. Validating the model
In order to make sure that the model is not overfitted, we validate the model using the `validate()` function from the `rms` package to estimate the R^2. We chose a 10-fold cross validation by setting the argument `method` to cross validation and the argument `B` to 10. We specify `pr = FALSE` because we do not want to have the results printed after each cross validation step.
The column `index.corrected` holds the R^2 corrected for overfitting by cross validation.
8. Probability not to churn at certain timepoint
Predictions in survival analysis are unfortunately not straight forward.
Let's look at data on a new customer stored in `oneNewData`, which is a one-row dataframe we specified using the `data.frame` function.
We use the `survest()` function from the `rms` package in order to estimate the probability that this customer has not churned until a certain timepoint specified by the `times` argument. In this case we chose 3 months. The estimated survival probability `surv` tells us that the customer will not churn within three months with a probability of about 90%.
9. Survival curve for new customer
Alternatively, we can estimate a survival curve for the new customer using the `survfit()` function with the argument `newdata = oneNewData` and plot it using the `plot()` function.
10. Predicting expected time until churn
In order to predict the expected time until churn we print the `survfit()`-object. The expected median survival time for the new customer is 65.
11. Learnings
Here is what you learned about survival analysis.
12. It is up to you now!
Now go ahead and practice!