1. Kaplan-Meier estimate
In the previous lesson, we talked about the survival function. In this lesson, we will take a look at how to estimate the survival function using the Kaplan-Meier estimate.
2. Survival function
The survival function is so popular because it has such a straightforward interpretation. In the survival context, the survival function gives the probability to survive beyond a time point small t. If you are familiar with the cumulative distribution function: the survival function is just 1 minus the cumulative distribution function. We denote the survival function here by capital S and the cumulative distribution function by capital F.
The survival function is a function over time and for any point in time you can say how probable it is to survive longer than that point in time.
To estimate the survival function, we will use Kaplan-Meier estimation. The Kaplan-Meier estimate is described by this formula. The red curve is the Kaplan-Meier estimate for the example data we used before. You might already know estimators for the cumulative distribution function and wonder: Why don't we just use that and take one minus this estimator to receive an estimator for the survival function? This could, in fact, be done if there is no censoring or the estimator is able to deal with censoring, which is usually not the case.
3. Survival function estimation
To get an intuition about what the Kaplan-Meier estimate does, let's compare the curve with the data. We see that the survival probability is 1 until the first event occurs at time point 4. There it drops quite a bit because two events happen (two people die). At time point 5 there is another smaller drop. Note that there is no drop at time points where we have censoring, for example, time point 2. Note also that we use these marks to visualize the time points of censoring.
Computing the Kaplan-Meier curve is actually not very hard. Let's do that now.
4. Survival function estimation: Kaplan-Meier estimate
We use the formula, which tells us to take the product of this stuff here for all previous time points. And we only need to look at time points where something happens. n_i denotes the number of individuals under observation at time point t_i and d_i denotes the number of individuals dying at t_i. At time point 2, we have 5 individuals under observation, one of them is censored at time point 2 but none die. So our survival probability is (5-0)/5=1. Note that we do not need to take a product here because all time points before have a value of 1 and 1 times x is the same as x. At time point 3 nobody dies, so the survival probability stays 1, but note how the number of individuals under observation has gone down from 5 to 4 due to the censoring of one individual. At time point 4 two people die, so the survival probability for this time point is(4-2)/2 = 2/4 = 1/2 = 0-point-5. At time point 5 we need to take the previous value 1/2 times the new value, which is 2 (for the 2 individuals under observation) minus 1 person dying divided by 2. At time point 6 nothing changes because there are no deaths.
5. Survival function estimation: Kaplan-Meier estimate
We can compute the same thing in R using the survfit function. Remember to combine the time and event variable using the Surv() function. The 1 behind the tilde shows that we want to compute one survival function for all observations. To reproduce the plot that we have been looking at, use the ggsurvplot() function. The first argument is the Kaplan-Meier estimate. The conf-dot-int argument signifies if you want a confidence interval, which we don't. risk-dot-table="nrisk_cumevents" signifies that we would like to see a table with the number of individuals under observation and the sum of events in brackets, which is shown below the main plot. It shows the number of individuals at risk and in brackets the number of individuals with an event. We need no legend because we have only one curve of all patients, so we set legend="none".
6. Let's practice!
Time to put this into practice.