Get startedGet started for free

Why use survival analysis?

1. Why use survival analysis?

So why do we use survival analysis to handle time-to-event data? What's different about this type of data?

2. Average battery life example

Hopefully, you remember the truck battery data we saw in the last lesson. Say we want to calculate the average battery lifetime by averaging the duration column. What problems could this approach have?

3. Average battery life example

Even though we have a duration column indicating the already-observed times that the batteries lasted, some batteries, like 1, 3, and 4 have not died.

4. Censorship in battery life

We know their durations but not how long their actual lifetimes are. This missing data issue is called censorship. When we take a simple average, using just the duration column would inappropriately include the durations as actual lifetimes for batteries 1, 3, 4 and other batteries that haven't died.

5. The censorship problem

Censorship happens when the survival time, or time until the event, is only partially known. There are many causes to this problem in time-to-event data. One, the event hasn't occurred at the end of our observation or at the time of the analysis. Two, we don't know if and when the event may occur because the individuals are no longer in the study.

6. Types of censorship

In the context of time-to-event data, the censorship problem could manifest in different ways. If the event occurred and the survival duration is known, the data is not censored. If the event has not occurred or the true duration is beyond our observation, the data is right-censored. If the event has happened but we know the true duration is shorter than our observation, the data is left-censored. Lastly, interval-censored data occurs when the event is observed, but individuals come in and out of observation, so the exact event times and survival durations are unknown. In this course, we will focus on right-censorship because it is the most common type.

7. Why is censorship bad?

Censorship in data causes all types of problems. For simple statistics, we have missing data that could skew our results one way or another. For regression, recall that Linear Regression draws a line to minimize the sum of squared errors. If the event is censored, we don't know the error terms and the regression line will be an ineffective one.

8. The survival function

Survival analysis is not a magical tool to fill in unknown durations. It uses a probability distribution to take censored data into account. It tries to model a probability function called the survival function, which gives the probability that an individual survives longer than some specific time t.

9. Survival analysis versus censorship

Even though survival analysis is a critical tool that handles censored data, censorship is not required to use survival analysis. It's preferred that only limited data points are censored to maximize the amount of information we have around survival patterns.

10. Checking data for censorship

There are 3 key steps to check for censorship issues and whether survival analysis would be appropriate. First, is there a column that indicates censorship? If not, is there a way to derive it from other columns? Second, what percentage of the data is censored? If more than 50% of the data is censored, even survival analysis will have limited effectiveness. Third, is the censorship non-informative and random? The censorship should have no impact on survival. For example, in clinical trials, sometimes patients withdraw from studies when their health deteriorates so much that they prefer to rest at home. In this case, censored patients have a lower rate of survival.

11. Let's practice!

We've learned what censorship is and why we use survival analysis for time-to-event data, now let's try it on real-world problems!