Confidence and lift

1. Confidence and lift

In this chapter, we'll discuss a variety of metrics that can be used to prune weak rules in your market basket analysis. We'll start in this lesson by learning how to analyze the relationship between items using two new metrics, confidence and lift.

2. When support is misleading

In the previous chapter, we learned about support. While support is a useful metric, it can be misleading if examined in isolation. For example, take the list of transactions shown in the table on the left, where bread and milk are frequently purchased together. A support-based rule would might tell us "if milk then bread." But intuitively we know that this isn't informative for the purpose of cross-selling. The association arises from both milk and bread being independently popular items.

3. The confidence metric

Fortunately, we can improve over support through the use of additional metrics. One such metric is called "confidence," which is defined as the support of items X and Y divided by the support of item X. Confidence tells us the probability that we'll purchase Y, given that we have purchased X.

4. Interpreting the confidence metric

Let's try calculating the confidence metric for the rule "if milk then coffee" using the first five transactions only. We first compute the support of milk and coffee, which is 0-point-2, since only one transaction contains both. We next compute the support of milk, which is 1-point-0.

5. Interpreting the confidence metric

This gives us a confidence value of 0-point-2, which means that the probability of purchasing both milk and coffee does not change if we condition on purchasing milk. This means that purchasing milk tells us nothing about purchasing coffee. But what if only four transactions contained milk and one of those also had coffee? The probability would rise to 0-point-25. Or what if it were only one transaction with milk and it also contained coffee, then the probability would rise further to 1-point-0.

6. The lift metric

The lift metric provides us with another way to improve over support. Lift is calculated as the support of items X and Y divided by the support of X multiplied by the support of Y. The numerator gives us the proportion of transactions that contain both X and Y. The denominator tells us what that proportion would be if X and Y were randomly and independently assigned to transactions. A lift value of greater than one tells us that two items occur in transactions together more often than we would expect based on their individual support values. This means that the relationship is unlikely to be explained by random chance. This natural threshold is convenient for filtering purposes.

7. Preparing the data

Finally, we'll conclude by introducing the goodbooks-10k dataset, which we'll use throughout this chapter. We'll load the data as a pandas DataFrame, converting it to a list, and then one-hot encoding it, just as we did in chapter 1.

8. Computing confidence and lift

While the dataset consists of ratings for 10000 books, we'll focus on just The Hunger Games and The Great Gatsby. For the exercises we'll conduct, a TRUE value indicates that the reader has rated the book highly.

9. Computing confidence and lift

Let's evaluate the rule "if Hunger Games then Great Gatsby." We'll start by computing the support for Hunger AND Gatsby by applying np logical AND, which returns a TRUE value if a reader has rated both books highly. We'll then compute the mean over all libraries, followed by the support for each book. We can now compute confidence and lift. If the reader ranks Hunger Games highly, that lowers the probability that he'll also rate Great Gatsby highly from 0-point-3 to 0-point-16. We can also see that lift is less than 1, which indicates the same.

10. Let's practice!

It's now time to practice computing confidence and lift in some exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.