1. Survival curve analysis by Kaplan Meier
In this video we'll cover survival curve analysis.
2. Survival object I
The first step is to create a new column that holds a survival object. This will be the dependent variable in our analysis.
First, we select the two relevant variables, `tenure` and `churn`. Then we use them to create the survival object with the `Surv()` function from the `survival` package and store it in the third column. We only display the first 10 rows of the resulting object with the `head` function.
3. Survival function
The measure of interest in a survival analysis is the survival function. This function gives the probability that a customer will not churn in the period leading up to the time point t.
4. Cumulative hazard function
The counterpart to the survival function is the cumulative hazard function. It describes the cumulative risk, or the probability that the customer will have churned, up until time t.
5. Hazard rate
The hazard rate, also called force of mortality or instantaneous event rate, describes the risk that an event will occur in a small interval around time t, given that the event has not yet happened.
Since the true form of the survival function is rarely known, a part of survival analysis is concerned with its estimation.
6. Survival function
7. Kaplan-Meier analysis
Without censoring, the estimation of the survival function would be easy - I could just calculate the percentage of customers who haven't churned yet at each respective timepoint. But with censoring, it's more complex.
The Kaplan-Meier-Estimator takes into account the number of customers who churned and the so-called "number at risk", that is, the customers who are still under contract and might churn in the future.
We use the `survfit()` function from the `survival` package to estimate the survival function. Note that the survival object is used as the dependent variable here. Since we are not considering any covariates, we put a 1 to the right of the tilde. By default, `type = "kaplan-meier"` is used. The values of the survival function at different time points are stored in the `surv` element.
8. Printing the survfit object
Applying the `print()` function to the fitted survfit model tells us that our dataset holds around five thousand three hundred customers, of which about one thousand eight hundred churned in the time under observation. The median survival time is 70, that is, about 50 percent of the customers do not churn before they reach a tenure duration of 70 months. The median survival time is the time where a horizontal line at 0.5 intersects the survival curve.
9. Plotting survival with confidence intervals
Additionally, plotting the survit object gives us a nice overview of the survival function and its confidence intervals.
10. Kaplan-Meier with categorial covariate
What you have seen so far is the estimation of the survival function independent of any other covariates.
The `survfit()` function however, allows us to easily model the survival function depending on different covariates. On the right-hand-side of the formula we just put `Partner`. Then, survival curves are estimated according to whether or not our customers have a partner.
Let's look at the median survival time by using the `print()` function. The median survival time of customers without partner is 45. For customers with partners it's `NA`, that is, it is higher than 72 and lies outside the observational period. This happens if only few customers churned within the time under observation. It is not a problem.
11. Plotting Kaplan-Meier with covariate
In the survfit-plot we see that customers with a partner hold on to a contract longer. We printed a legend so that it is easier to distinguish the two groups.
12. Let's practice!
Now it's your turn!