Get startedGet started for free

Predicting customer transactions

1. Predicting customer transactions

Fantastic! Now we are ready to predict next month's transactions.

2. Modeling approach

We will use linear regression to predict next month's transactions. The steps we'll follow are the same as the ones in the logistic regression or any other supervised learning.

3. Modeling steps

Let's go again through the supervised learning modeling steps: The first step is splitting the data into training and testing. Second step is initializing the model. Then, fitting the model on the training dataset. Afterwards, predicting the values on the testing data. And then finally evaluating the model performance by comparing the predicted and the actual values in the testing data. Since we've already learned how to split the data to training and testing, we will start from the second step.

4. Regression performance metrics

To measure regression performance we will use different metrics than classification, given the outcome is a continuous variable. The lower these metrics, the better the performance of the model. The key metrics for measuring regression performance are: Root mean squared error or RMSE for short. This is a square root of the average squared difference between the predicted and the actual values. We calculate it by building the model, then subtracting predicted and actual values, squaring them, calculating the average, and taking the square root to get a normalized error measurement. Squaring the differences is important to make sure the positive and the negative errors don't cancel each other out when summing them up. Another metric is mean absolute error. It is similar to the root mean squared error, only here we take the average of absolute differences between predicted and actual values. This way we don't need to square the differences, and the metric is less vulnerable to outliers. Final commonly used metric is the mean absolute percentage error - it measures the percentage difference between the predicted and the actual values. It's an intuitive metric as it is normalized to be within 0 and 100%, but it requires the actual target values to be higher than zero to not get into division by zero error. This is not the case in our example, so we won't use it here.

5. Additional regression and supervised learning metrics

There are some additional regression metrics we will explore. R-squared is a statistical measure that calculates the percentage proportion of the inherent variance that is explained by the model. It can only be calculated for regression models, not classification. The higher the percentage, the better our model explains the variance. Another important metric is p-value that measures the statistical significance of the so-called null hypothesis which states that the model or coefficient were observed due to chance. The lower the p-value, the better. Typically, we look for less than 5 or 10% p-values to conclude the model or coefficients as statistically significant.

6. Fitting the model

Great, now, we can build the model. First, we import the linear regression module. Then, we initialize the its instance. After that, we fit the model on the training data. Finally, we predict the values using the trained model on both the training and the testing datasets.

7. Measuring model performance

Measuring regression performance is as straightforward as classification. First, we import the measurement functions. We can see that the second one is called mean squared error. We will have to calculate square root of it by ourselves. Next, we calculate the metrics by feeding the actual values and then the predicted values. We do this for both training and testing, and then print the results. As you can see, the errors are slightly higher for the test data, which is expected, as the training data was used in model building, and match the patterns better than in the unseen testing data. The mean absolute error is smaller, as it is less sensitive to outliers where the error is higher, which mean squared error methodology emphasized by squaring them. The mean absolute error means that comparing actual transactions in November to the predicted transactions in November, our model is off by roughly half transaction. Not too bad for a simple linear model with just 5 features!

8. Interpreting coefficients

Final step, is to interpret the coefficients. While we can extract the coefficients in the same way we did with logistic regression, here we want to introduce the concept of statistical significance. We want to make sure the coefficients are not random, and they can be used for model interpretation. The standard statistical significance threshold is 95%. The statsmodels library provides functionality to build statistical models like sklearn and more, as well as print an in-depth model performance summary with multiple performance metrics. Most of them are more interesting to a statistician, but some are important to the analyst as well.

9. Build regression model with statsmodels

Building linear regression model with statsmodels is similar to scikit learn. First, we import the library's api. Then, it's important to transform the target variable to numpy array. Afterwards, we initialize the model by calling the ordinary least squares function abbreviated OLS. And then, we fit the model on the training data. One difference from scikit learn here - we enter the training data in the initialization step, not in the model fitting step. Finally, we print the model summary.

10. Regression summary table

Now, this is just a part of the actual summary, you will see the full table in the exercises following this lesson. There are two main things we will analyze here - R-squared and p-values.

11. Interpreting R-squared

The R-squared metric in the top right. This is the percentage of explained variance. It means that the model explained roughly 48.8% of the variation. It's not uncommon to see both higher and lower values, but it's a good metric to additionally assess model performance. Typically, low R-squared value means the model poorly fits the variation in the target variable.

12. Interpreting coefficient p-values

Next, we check the statistical significance of the coefficients. We can see that all five features listed in the table. The first value is the actual coefficient, which can be interpreted as the change in the output variable given one unit increase in the feature. For example, the coefficient for frequency is roughly 0.13. This means that customer who's frequency is higher by 1 unit (or 1 invoice) in the pre-November period, will have 0.13 invoices more in November on average. Some of these coefficients are not statistically significant though. If we decide that our statistical significance level is 95%, then we will only interpret coefficients with p-value lower or equal to 1 minus the significance level, or 5%. There are only two coefficients with p-value lower or equal to 5% - frequency and quantity_total. If we reduce the significance to 90%, then we can interpret recency coefficient as statistically significant as well.

13. Let's build some regression models!

Great progress! You've learned the second type of the supervised learning technique - regression modeling. Now, let's go practice!