1. The simplest metric
Market basket analysis is centered around the identification and analysis of rules. As we saw earlier in the chapter, there are many rules, even if we limit ourselves to those about the association between two items.
But what if we limit ourselves to useful rules? We'll do that in this video. To get those rules, we'll make use of a metric called support and a process called pruning.
2. Metrics and pruning
A metric is a measure of the performance of a rule. For example, under some metric, the rule "if humor then poetry" might map to the number 0-point-81. The same metric might yield 0-point-23 for "if fiction then travel."
Pruning makes use of a metric to discard rules. For instance, we could keep only the those rules with a metric value of greater than 0-point-50. In the example we gave, we'd retain "if humor then poetry" and discard "if fiction then travel."
3. The simplest metric
The simplest metric is something called support, which measures the frequency with which itemsets appear in transactions.
Support can also be applied to single items. For instance, in the small grocery store dataset, milk is one of the nine items that appear in transactions. We can compute milk's support as the number of transactions that contain milk, divided by the total number of transactions.
4. Support for language
As a concrete example, let's check the support for language in the first 10 transactions of the bookstore dataset.
Language appears in transaction 1 and 3. Thus, the support value is 2 out of 10 or 0-point-2.
5. Support for {Humor} $\rightarrow$ {Language}
What if we instead want to check the support for an association rule, such as "if humor then language"? We would compute the share of transactions that contained both humor and language.
In that case, we can see there is only one, so the support is 1 over 10 or 0-point-1. Notice that we would get the same value if we instead computed support for "if language then humor."
6. Preparing the data
Now that we've defined support, let's see how we can compute it in a more systematic way for all items. We'll focus on the first ten items in the bookstore dataset, which we'll assume have been imported as a pandas DataFrame and then converted to a list of lists called "transactions."
We'll next import TransactionEncoder from the preprocessing submodule of mlxtend.
After that, we'll instantiate a transaction encoder and use its fit method to identify the unique items in the dataset.
7. Preparing the data
We next use the transform method to construct an array of one-hot encoded transactions called onehot. Each column in onehot corresponds to one of the nine items in our dataset. If the item is present in a transaction, this is encoded as TRUE. Otherwise, it is FALSE.
Finally, we'll use this array to construct a DataFrame. We'll use the item names as column headers and will recover them using the column underscore attribute of encoder.
8. Computing support for single items
We can now calculate the support metric by computing the mean over each column.
9. Computing support for multiple items
What if we want to compute the support for a rule, such as "if fiction then poetry"? We can create a new column in the DataFrame that is TRUE if both the fiction and poetry columns are true using numpy logical and, along with the two columns as arguments.
We can again print the column means to get the support values.
10. Let's practice!
It's now time to compute support in some exercises!