Survival analysis: introduction
1. Survival analysis: introduction
In this chapter we are still interested in the probability of a person churning. We saw that this is a case for a logistic regression.2. no title
In customer relation management, however, we are often concerned with censored data. That is, the customer journeys end at the current point in time just because we cannot see what happens in the future. This is a kind of missing data that logistic regression cannot handle. And this is where survival analysis comes into play.3. Advantages survival model
Survival analysis allows us to model the time to an event, also called failure or survival time. This avoids a loss of information due to aggregation. In the end, survival analysis allows us to obtain deep insights into customer relations since it is possible to model when an event will take place and not just if it will take place. If we predict that a certain customer is likely to end her contract within the next three months, special actions can be taken to keep her from churning.4. Censored data
Censoring is a special case of missing data. The kind of censoring we are concerned with most often is random type I right-censoring. This means that a subject's event can only be observed if it occurs before a certain point in time. In the case of churn prediction, churn must happen before the current point in time to be observed. If it hasn't happened yet, we cannot know whether a person is going to churn in the future. We do, however, know that, for the time under consideration, a person has not yet churned. This is more information than completely missing data. Since customers enter our database at different points in time, censoring times can vary between subjects.5. Data for survival analysis
There are basically two different pieces of information necessary for survival analysis: the time under observation and the status at the end of this time. According to the status we can conclude if an observation was censored or not. Additionally, for survival analysis using covariates, we need further information about the subjects in the dataset. In the following example we are looking at telecommunication data. The time under observation here is the tenure time, or the amount of time a person was a customer. This is measured in months and stored in the variable `tenure`. Whether or not a person churned can be seen in the variable `churn`. The slide shows part of the output from the structure function.6. Tenure time
We plot two histograms of the distribution of the tenure time dependent on whether or not a person churned using the function `ggplot()` from the `ggplot2` package. First, we mutate the variable churn such that its values are labeled "yes" and "no". We tell R to draw histograms by using `geom_histogram` and specify the color of the histograms to be dependent on the factor `churn` by adding the variable churn to the `fill` argument. By adding `facet_grid`, we make sure to draw one histogram for people who churned and one histogram for people who didn't churn. The argument `legend.position = "none"` in the function theme omits drawing a legend. Not surprisingly, for the group of people who churned tenure times are shorter.7. Tenure time
8. Tenure time
9. Let's practice!
Now it's your turn.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.