Get startedGet started for free

Data preparation for purchase prediction

1. Data preparation for purchase prediction

Great job! Now, we will prepare data to predict next month's transactions using linear regression.

2. Regression - predicting continuous variable

We will use the second type of supervised learning called regression. Contrary to classification, which we used to predict customer churn, in regression, the target variable is either a continuous, or a count value. The simplest model is the linear regression which creates a linear equation with coefficients, explaining how much a 1 unit change in the feature affects the change in the outcome variable. There are other more complex models like Poisson or Negative Binomial Regression, that work well with count data, but we will focus on the feature engineering, and the basic principles in this course.

3. Recency, frequency, monetary (RFM) features

To build the input features, we will start with the so-called RFM features that stand for recency, frequency and monetary value. The concepts behind RFM underlie many feature engineering methods. Here we will calculate it on a customer level. Let's get a little bit into what each of them means. Recency is the time - typically in days - since the last customer transaction. Frequency is the number of purchases in the observed period. Typically, it's a full year, but that depends on the business type and product lifecycle. Finally, monetary value is the total amount or revenue that was spent in the observed period.

4. Explore the sales distribution by month

We will predict the number of purchases for the future month. To do that, we need to decide how we will define that future month. We can print out the number of observations we have in each invoice month, and we can see that there are enough observations in all of them. We will use the last month in our dataset - November 2011 - as the period for the target variable.

5. Separate feature data

Let's build the features now. First we will exclude the target month data from our dataset so we don't accidentally include these values into our independent variables. We will store the filtered dataset as online underscore X. Then, to calculate the recency, we need to create an artificial snapshot date. We will set it to the first day of the beginning of the target timeframe, as if we were pulling the data up to November 1st to predict November sales. In real life, we might be pulling the latest data that's current and we can use that day's date, but when working with historical values, this is required to get an accurate recency number. Then, we group by the CustomerID and calculate multiple features using the aggregate method called agg. First, we calculate the recency by calculating the difference in days between the previously defined snapshot date and the latest Invoice date for the customer. Here, we use lambda to write a simple one-line function. Then, we calculate frequency by counting the unique number of invoices. Third, we sum the revenue spent by that customer to get the monetary value. Finally, we add some extra features built on the Quantity variable, where we calculate both the average and the total quantity purchased by the customer. We're reseting the index to make sure the CustomerID is stored as a column, not as an index by default, so we can use it later. Then, we assign new names to the columns to make them more readable.

6. Review features

Let's check the top rows of the new features dataset. As you can see, we now have a row for each customer with five newly calculated features.

7. Calculate target variable

Great. Now we will calculate the target variable, and some optional features that you can try out later by yourself. We take the original unfiltered online dataset, and build a pivot table with customers as rows passed to the index parameter, InvoiceMonth as columns, and then the Invoice number as the values. We pass the pandas series nunique function to calculate the number of unique invoices for each customer monthly, and make sure the missing values are filled with zeros. Let's print it out to see what we get. The result is a pivot table with monthly number of unique invoices per customer. We will only use the last month as the target variable, but you can also try it on your own to use the other monthly customer purchase data as input features to test, if they can improve model performance. These are so-called lagged features as they are the same metric as the target variable, but are recorded prior to the event we are trying to predict. Now, we need to select the target variable.

8. Finalize data preparation and split to train/test

First, we store the column names for the customer id and the target variable as separate lists. Then, we extract the target column from the previously built pivot table. Finally, we extract the feature column names from the features dataset, exclude customer id, and extract the features into dataset called X. One thing to mention - we are not using the other monthly purchase information from the pivot table. We will only build the model on the features dataset, but we urge you to test, if including these lagged monthly features on top of the RFM ones, would improve the model performance.

9. Split data to training and testing

Finally, we split the data into training and testing, assign 25% to the testing dataset, and print the dataset dimensions to confirm the allocation worked well.

10. Let's work on data preparation exercises!

Great work everyone! Now let's go practice data preparation and feature engineering!