Logistic regression algorithm
Let's dig into the internals and implement a logistic regression algorithm. Since statsmodels
's logit()
function is very complex, you'll stick to implementing simple logistic regression for a single dataset.
Rather than using sum of squares as the metric, we want to use likelihood. However, log-likelihood is more computationally stable, so we'll use that instead. Actually, there is one more change: since we want to maximize log-likelihood, but minimize()
defaults to finding minimum values, it is easier to calculate the negative log-likelihood.
The log-likelihood value for each observation is $$ log(y_{pred}) * y_{actual} + log(1 - y_{pred}) * (1 - y_{actual}) $$
The metric to calculate is the negative sum of these log-likelihood contributions.
The explanatory values (the time_since_last_purchase
column of churn
) are available as x_actual
.
The response values (the has_churned
column of churn
) are available as y_actual
.
logistic
is imported from scipy.stats
, and logit()
and minimize()
are also loaded.
This exercise is part of the course
Intermediate Regression with statsmodels in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Complete the function
def calc_neg_log_likelihood(coeffs):
# Unpack coeffs
____, ____ = ____
# Calculate predicted y-values
y_pred = ____
# Calculate log-likelihood
log_likelihood = ____
# Calculate negative sum of log_likelihood
neg_sum_ll = ____
# Return negative sum of log_likelihood
return ____
# Test the function with intercept 10 and slope 1
print(calc_neg_log_likelihood([10, 1]))