1. Churn prevention in online marketing
Welcome to this chapter! Our next topic is churn prevention in online marketing. We will be using logistic regression for this. I hope you enjoy it!
2. Churn prevention
Imagine you run an online shop. Since gaining new customers is way more expensive than keeping existing ones, you like to distinguish loyal customers from the ones who only buy once. The latter could then get an incentive, like a coupon.
The tool we use to predict the probability of a customer returning to an online shop is called binary logistic regression.
Lets take a look at the details!
3. Binary logistic regression
The measure of interest is the probability of a customer churning (formula 1). Unfortunately, it is not easy to model this probability directly.
If I use a linear model like in the last chapter, I can end up with non-sensical predictions, like probabilities less than zero or greater than one.
What I can model are the so-called log odds that you can see in the second equation. Removing the log by using the exponential function gives us the odds, which is the probability to churn divided by the probability not to churn - This is the third equation. I will get back to these for
interpretation of the model coefficients later. In the fourth equation, you can see the final model. It gives us the probability of the target variable being equal to one, or the probability of a customer churning.
That's enough about the statistical background. Let's tackle the real life problem and take a look at the data!
4. Data discovery I
The dataset we will be using includes around 45,000 observations of 21 variables. They describe the previous purchases and give information about the customer, for example, title or newsletter subscription status. You can
see an extract of the dataset on the slide.
In the next step, we will investigate returnCustomer, our variable of interest, using a bar chart.
5. Data discovery II
I hand the dataset to the `ggplot()` function from the `ggplot2` package and use the `aes()` function in order to specify `returnCustomer`as the variable of interest. Then, I add the layer `geom_histogram` with `stat = count` in order to tell `R` to draw a bar chart.
The plot shows that around 9,000 customers returned to the online shop and approximately 38,000 didn't. With only about 24% of customers returning, the dependent variable is not very balanced in this case.
6. Let's start analyzing!
Now, I think it is time for an exercise!