1. Your first survival curve!
We learned the ins-and-outs of time-to-event data. Now, let's deep dive into how survival analysis works and start drawing survival curves.
2. The survival function
Use capital T to denote when an event of interest occurs, and lower case t to denote any point in time during our observation. The survival function is a function of time t that gives the probability that the event happens after t, in other words, the probability of an individual surviving past t. This probability is called the survival probability.
3. The survival curve
To visualize the survival function, we use the survival curve. On the X-axis is time, usually from the beginning of the observation to the end. On the Y-axis is survival probability.
4. The survival curve
Using time-to-event data, we calculate what proportion of the population has not experienced the event at each point in time. Connecting these survival probabilities, we have a line that tells us the probability of survival past any given time.
5. The survival curve
If the dataset is large, this survival curve approaches the true survival function for the population.
6. Interpreting a survival curve
The survival curve is very information-rich. Any point on this curve tells us the probability of an individual surviving longer than a given time.
7. Interpreting a survival curve
For example, at t = 5, there is a 50% probability that an individual survives longer than 5 units of time. The vertical distance from one point to another on the curve means how much the survival probability changes when the time changes from one to another. A flatter curve means that not many individuals experience the event during the time interval, and a steeper curve means that many do.
8. Non-parametric versus parametric models
Ways to estimate the survival function can be categorized as non-parametric or parametric. Non-parametric means that we make no assumptions about the shape of the data. Parametric means that we impose a shape onto our data and use a fixed set of parameters to describe it.
9. Non-parametric versus parametric models
Non-parametric modeling empirically describes the data, therefore it's very flexible. But the survival function won't be smooth because we don't have observations for every single point in time. Parametric modeling solves the smoothness problem, but the distribution needs to be a good description of the data.
Non-parametric and parametric survival curves may look different, but their interpretations are identical.
10. Drawing a survival curve
So how do we draw a survival curve in Python?
The lifelines package is a survival analysis library that can fit survival functions and plot survival curves. After importing the lifelines package and matplotlib, we use the dot-fit method to fit a survival function based on the durations and censorship flags in our data and the dot-plot_survival_function function to plot the survival function.
11. Survival curve example
Say we are measuring time till repayment for mortgages. The id column represents a mortgage; the duration column represents the number of years the mortgage is not paid off; the paid_off column represents whether at the time of observation the mortgage is fully paid off.
12. Survival curve example
To construct a survival curve, first, we import lifelines and the pyplot module from matplotlib for plotting.
Next, we instantiate a KaplanMeierfitter object and fit our data to a survival function. We pass the duration column to the duration parameter, and the paid_off column to the event_observed parameter to indicate censorship. The Kaplan Meier estimator is a non-parametric way to estimate the survival function. We will learn more about it in the next chapter.
Lastly, calling plot_survival_function on the survival function object, we can plot the survival curve. plt-dot-show displays the figure.
13. Let's practice!
Now it's your turn to draw survival curves with real-world data!