Other distributions and model selection

1. Other distributions and model selection

There is a variety of models for survival modeling. We will learn how to effectively and scalably compare their performance in this video.

2. Which model fits the data the best?

It's not always clear what distribution underlies a time-to-event dataset. We applied four different parametric models on the same data here. Compared to the Kaplan-Meier survival curve as a baseline, could you tell which model fits the data the best?

3. Choosing parametric models

Non-parametric modeling is distribution-free hence describes the data accurately. However, it's often difficult to extrapolate insights from. Parametric modeling is very effective when the distribution chosen is a good fit for the data. An unfit model could lead to severely biased conclusions. Often, we have to assess the goodness-of-fit of a parametric model before using it.

4. Common parametric survival models

Many parametric models are commonly used for survival analysis, including the Weibull model which we learned about, the exponential model, the gamma model, the log normal model, and the log logistic model. We will not learn the specifics of each model in this course, but it's important to have a general framework for comparing them.

5. The Akaike Information Criterion (AIC)

The Akaike Information Criterion, or AIC metric, is an estimator of prediction error and the quality of a model. It estimates the amount of information loss by any model and penalizes models with many parameters. Given a set of models, we could compare their AIC values. The model with the lowest AIC value is the preferred model.

6. Using the AIC for model selection

To use the AIC metric, first, we fit the parametric models. Next, we print the AIC values by calling the AIC property of each model. Lastly, we pick the model with the lowest AIC value. As an example, here we fit 3 parametric models separately and print each model's AIC property. The log normal model has the lowest AIC value, meaning it's a better fit for the data than the Weibull model and the Exponential model.

7. find_best_parametric_model()

Conveniently, there is a built-in lifelines function called find_best_parametric_model that iterates through all available models in the library. We could specify AIC as the scoring method, and the function will return the model with the lowest AIC and the AIC value itself. To use the function, we pass the durations column to the event_times parameter and censorship column to the event_observed parameter. We store the output, which is a tuple of model object and AIC score, as best_model_ and best_aic_. In this example, the Weibull model wins out.

8. The QQ plot

Another method to gauge goodness-of-fit is the QQ plot, which compares two distributions by plotting their quantiles against each other. If the theoretical distribution is close to the empirical distribution from the data, the points in the plot will approach the line y=x. Otherwise, it would deviate. In this example, the left plot shows a decent fit, but the right plot does not.

9. Using QQ plots for model selection

To use the QQ plot, first, we fit the parametric models. Next, we plot each model using the qq_plot function. Lastly, we analyze which plot is closest to y=x. As an example, here we iterate through 4 parametric model instances and fit each one. Upon fitting, we call the qq_plot function to plot the model's qq_plot.

10. Using QQ plots for model selection

The upper right plot, the log normal model, shows the most promising fit. The log logistic model, while similar, deviates from the line y=x more at its tail. The other two plots show more extreme outliers that deviate from the straight line. The QQ plot is a more subjective way to assess model fit and sometimes the differences are subtle.

11. Let's practice!

Now let's practice choosing models!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.