1. Aggregation
In this video, we'll introduce the concept of aggregation, which can be used to simplify market basket analysis problems that involve many items.
2. Exploring the data
We'll work with a new dataset of novelty gifts from an online store, which we'll also use throughout the chapter.
Let's load the data and preview its contents.
The DataFrame contains two columns: the invoice number and the description of an item purchased.
Notice that the first five items are included in the same invoice.
3. Exploring the data
How many transactions are there in the dataset? Applying the unique method to the invoice number column, we can see that there are 9709.
And how many unique items appear in the transactions? 3461.
In the previous chapter, we worked with the GoodReads dataset, which contained many user libraries, which we used to construct transaction-like objects. We did not, however, consider the full set of items -- that is, books -- in libraries, but restricted ourselves to comparisons between small sets of books.
We're now faced with a much more substantial challenge: we have many transactions and many items and we need to find useful rules.
4. Pruning and aggregation
When there are too many items to evaluate all possible association rules, we can use pruning or aggregation to simplify the problem.
Pruning, which we'll return to later in this chapter, removes items and rules with low support or poor performance along some other measure.
Aggregation, which we'll discuss in this video, groups items together into categories or aggregates. This reduces the market basket analysis problem to the identification of rules between categories of items.
In our case, this might mean selecting the subset of items that are bags and boxes and aggregating them into bag and box categories.
5. Aggregating the data
So far, we've looked at the data in its original form, a DataFrame with one item per row. Before using this data, though, we would apply the standard pre-processing pipeline to convert it into one-hot encoded format.
We'll load it in one-hot encoded format here and then preview the data again. Notice that each column corresponds to an item and each row corresponds to a transaction. A column entry is labelled TRUE if the item appears in a transaction and FALSE otherwise.
6. Aggregating the data
In practice, you will often encounter datasets that have already been aggregated into categories, such as clothing, food, and technology. However, for our purposes, we will assume you are working with one-hot encoded data in csv format.
Given the circumstances, you will need to aggregate the data yourself. Let's start by using list comprehensions to collect the column names related to bags and boxes. Notice that we have converted each column name to lowercase letters and checked whether the relevant string is present.
Next, we select the columns in the DataFrame that correspond to the lists of bag and box column headers.
7. Aggregating the data
We then sum across the columns to check whether at least one item in the transaction is a bag or box.
8. Aggregating the data
Finally, we use the vertical stack method of numpy to merge bags and boxes into a single array. We then pass that array, along with a list of column names, to a pandas DataFrame.
9. Market basket analysis with aggregates
The standard aggregation process works as follows. First, items are mapped to categories, as we've done. Next, we compute metrics, allowing us to identify useful rules.
We can, for instance, compute the support metric by applying the mean method to the DataFrame.
10. Let's practice!
We now know how to perform aggregation. Once that is done, everything else we've learned previously can be applied, so let's do that in some exercises.