1. The Apriori algorithm
In this video, we'll discuss the Apriori algorithm, which will help us reduce the complexity of large datasets by eliminating low support itemsets.
2. Counting itemsets
We'll start by revisiting the problem with market basket analysis: there are many items and, as we know from Chapter 1, many items implies many association rules.
As we'll see soon, the Apriori algorithm prunes itemsets, rather than association rules. There are fewer itemsets than rules, but the number is still enormous.
In particular, the number of itemsets we can generate by choosing k items from a set of n is given by n factorial over n minus k factorial times k factorial. This is the number of "combinations" and is stated as n choose k.
With the 3461 items in our novelty gifts dataset, there is 1 itemset, the null set, that contains no items, 3461 itemsets with 1 item, almost 6 million with 2, almost 7 billion with 3, and almost 6 trillion with 4.
3. Counting itemsets
What if we allow any number of items? We can get this by summing over all of the combinations. The total number of itemsets of any size we can generate from n items is 2 to the nth.
For the gift dataset, this is 2 to the 3461st. This number is much larger than 10 to the 82nd, which is the estimated number of atoms in the observable universe.
4. Reducing the number of itemsets
If the number of itemsets is greater than the number of atoms in the universe, it will, of course, be impossible to evaluate them all. We can't, in fact, even enumerate them all.
But how can we remove an itemset without even evaluating it first? We could use a crude rule, such as excluding all itemsets with more than k items, but as we've seen, this k would probably have to be three, which would eliminate a lot of interesting rules.
The Apriori algorithm offers an alternative approach that does not require the enumeration of all itemsets and uses a sensible criterion for pruning large numbers of itemsets.
5. The Apriori principle
The Apriori algorithm is structured around the idea that we should retain items that are frequent -- that is, exceed some minimal level of support.
The Apriori principle states that subsets of frequent sets must also be frequent. The Apriori algorithm uses this principle to retain frequent sets and prune those that cannot be said to be frequent.
What does this mean in practice? Let's assume we're using the aggregated data and find that candles aren't frequent. That is, they fall below the minimum support. That means that candles and signs also aren't frequent. If candles and signs aren't frequent, then candles, signs, and boxes are not frequent. And so on.
Computing support just once for candles allowed us to eliminate one fourth of all possible combinations without even enumerating them.
6. Apriori implementation
Let's apply the Apriori algorithm to the one-hot encoded data from the novelty gifts store. We'll import apriori from mlxtend's frequent patterns submodule and print its header.
7. Apriori implementation
We can then apply the Apriori algorithm, which will take the minimum support value, along with an optional argument for maximum itemset length. Setting colnames to true will return the items in itemsets by name, rather than number.
Printing the length of the output, we can see that we've gone from 6 trillion itemsets to just 3652.
8. Apriori implementation
Let's also preview the output, which is a DataFrame with a column for the support and another column for the itemset.
9. Let's practice!
It's now time to practice applying the Apriori algorithm in some exercises.