Get startedGet started for free

Applying survival analysis to groups

1. Applying survival analysis to groups

A practical case of survival analysis is comparing survival functions between groups of data.

2. The mortgage problem

Let's start with an example. We are analyzing the survivals of mortgages but with additional information. The property type of each mortgage is listed as either "house" or "apartment". We are interested in whether there are differences in time to payoff between house and apartment mortgages, how should we incorporate survival analysis?

3. Comparing groups' survival distributions

Comparing survival functions between groups is common for time-to-event studies. The groups might have different qualitative attributes, such as different types of mortgages or different brands of tires. Often the attribute is the experiment group assignment in randomized controlled trials. We might also be interested in comparing groups with different values for the same attribute, such as high- or low-income households.

4. Types of survival group comparisons

There are various depths of comparisons we could make. At the most basic level, we could compare two groups' survival probabilities at specific times or total proportions of survived subjects. This gives us a general idea of how similar survival functions are and their directional differences. Depending on the use cases, we might choose different tools to model the comparison.

5. Types of survival group comparisons

For experiment groups, we often want to know if the underlying survival distributions are different. This comparison requires formal testing procedures.

6. Types of survival group comparisons

To quantify how much an attribute affects survival, regression-based methods are needed. We will focus on the first use case for now.

7. Visualizing group differences

Plotting groups' survival curves side-by-side is a great way to compare survival distributions. It's simple to execute and interpret and flexible for different time-to-event data. In general, it's a useful tool to demonstrate the differences in survival distributions visually.

8. Identifying the groups

To start fitting the Kaplan-Meier estimator to our data, we will use Boolean masking to create references to the groups. This is so we can keep our model-fitting code clear. Let's revisit the mortgage example from the start of the video. We create a mask called "house" that returns TRUE if the property type is a house, and FALSE otherwise. Similarly, we create a mask called "apartment". Because we only have 2 groups, the apartment mask is when the house mask returns False so we don't necessarily need it.

9. Fitting and plotting survival curves

After we import the KaplanMeierFitter class and pyplot from matplotlib, we will create a figure called ax using plt-dot-subplot() and instantiate one KaplanMeierFitter class called mortgage_kmf. Next, we fit mortgage_kmf to the house data only and add a label attribute called "Houses" for plotting. When we run plot_survival_function, we could specify to use the figure ax and the label will automatically be applied.

10. Fitting and plotting survival curves

Now we do the same thing for the apartment data and plot it on the same figure. We used the same KaplanMeierFitter class to refit the data, which overwrites the house survival function. We could also create multiple separate instances of the KaplanMeierFitter class and fit multiple models. The benefit of the multiple-instance approach is that you could later reference each one of them easily.

11. Visualizing side-by-side

plt-dot-show displays the figure ax that we drew 2 survival curves on with their labels.

12. Interpreting groups' survival curves

How do we use this plot to assess whether the two survival curves are different? First, by the slopes of the curves, we determine that apartment mortgages are paid off faster than house mortgages on average. More precisely, at each time, a higher proportion of mortgage holders will have paid off apartment mortgages than house mortgages. The confidence intervals may overlap. For the overlapped areas of the curve, we likely observed differences due to chance.

13. Let's practice!

Now let's practice!