1. Weibull model with covariates
Often, we want to evaluate how survival functions are affected by additional data we have about the population.
2. Comparing survival functions
We learned that we could compare survival functions using the Kaplan Meier estimator and log-rank test. But could we assess how one or multiple continuous variables affect a survival function?
3. Survival regression
We use a technique called survival regression to model survival functions with covariates and quantify their effects on the survival function.
Similar to other regression models, we regress the covariates against another variable, in this case, durations.
Covariates may be a quantitative measurement, such as a person's age and weight. They may also be qualitative information such as country.
4. The Accelerated Failure Time (AFT) model
One survival regression framework is the accelerated failure time model.
Suppose we have two groups, A and B, with different survival functions. They are related by some rate lambda. The curves can be interpreted as speeding up or slowing down along each other. lambda is the accelerated failure rate. The AFT model uses this logic to vary the survival function based on subjects' covariates.
When a covariate changes from a to b, the average survival time changes by a factor of lambda. An example is that dogs age 7 times faster than humans and their average lifetime is 7 times shorter.
5. Data for survival regression
The data for survival regression has a duration column, and often some covariate columns and a censorship column.
Categorical variables should be one-hot encoded to be 0 or 1.
And if no censorship column is provided, the model assumes that there are no censored subjects.
6. Combining Weibull with AFT: the Weibull AFT model
We also need a model for the survival distribution. The Weibull model is commonly used and its AFT regression implementation is coded in the lifelines package as the WeibullAFTFitter class.
In mortgage_df, all columns except the duration and paid_off columns are covariates. Notice that we replace property_type with a dummy variable house.
To run a Weibull regression, we first import the WeibullAFTFitter class and instantiate it. Then, we call the dot-fit method on the DataFrame of interest. The first argument is the name of the DataFrame itself, then we fill in the duration_col parameter and the event_col parameter. The string values of column names are used instead of the columns themselves. Note that this is different from the other lifelines fitters that we learned.
7. Interpreting model output
After fitting, we print the summary property.
Unlike the coefficients in a linear regression, E raised to the power of the coefficient indicates how much the average survival time changes with a one-unit change in the covariate. This is why the second column exp-coef is sometimes more useful.
From above, we see that credit score has a large negative coefficient. Each 1 point increase in credit score changes a subject's average time to payoff by e to the -0-point-16th power which is 0-point-85, approximately a 15% decrease than average.
The p-value indicates statistical significance. Credit score, interest, and the intercepts themselves are statistically significant.
8. WeibullAFTFitter with custom formula
Alternatively, we may want to specify which covariates to regress on or any interaction terms. We could use the formula parameter to handle the right-hand-side of the model. This is analogous to the linear model with a coefficient for each term and the interaction term.
9. Interpreting model output
From the rows containing interest and the interaction term interest times house, we see that each 1 point increase in interest changes a house's average payoff time by e to the 0-point-17th power which is 118%, and an apartment's average payoff time by e to the 0-point-11th power which is 111%.
10. Let's practice!
Let's practice!