Get startedGet started for free

Identifying association rules

1. Identifying association rules

In this video, we'll discuss the fundamental problem of market basket analysis -- namely, taking an enormous set of potential association rules and selecting only those which are useful for a specific business application.

2. Loading and preparing data

Throughout this chapter, we'll make use of two datasets. In the videos, we'll make use of bookstore data. And in the exercises, we'll apply what we've learned to grocery store data. Let's start by loading and preparing the bookstore data. We'll first use read csv to load the book libraries as a pandas DataFrame. We'll then select the transaction column, where a "transaction" refers to a user library. Next, we convert the comma-separated transaction strings into lists and then convert the DataFrame into a list of lists, just as we did in the introductory video.

3. Exploring the data

Now that we've loaded the data, let's print the first five transactions to get a feel for how it's structured. Notice that the length of each list of transaction items isn't fixed. Some contain four book genres and others just one. Additionally, we can now see that the small bookstore has grown to include many more genres.

4. Association rules

Recall that an association rule contains an antecedent and a consequent. A simple rule with one antecedent and one consequent might be "if health then cooking." We can also have more complicated rules, which have multiple antecedents, such as "if humor and travel then language." Or multi-consequent rules, such as "if biography then history and language."

5. Difficulty of selecting rules

Finding good rules can be challenging. For most datasets, the number of possible rules is enormous. Since most rules are not useful, we must find a way to discard rules that are unlikely to be helpful for the task at hand. We could start, for instance, by looking exclusively at simple rules with one antecedent and one consequent. As we will see, this is still challenging, even when we only have 9 genres.

6. Generating the rules

Recall that there were nine genres in the dataset: fiction, poetry, history, biography, cooking, health, travel, language, and humor.

7. Generating the rules

Let's iterate through all one-antecedent, one-consequent rules. We can do this by starting with those that have fiction as the antecedent and pairing fiction with all possible consequents. Next, we switch to poetry as the antecedent and pair it with all possible consequents. We repeat this for all remaining possible antecedents. Since we only consider unique items in a transaction, we will not include rules where the antecedent and consequent are the same. This yields 72 rules, even though we only had 9 items and ignored multi-antecedent and multi-consequent rules.

8. Generating rules with itertools

Fortunately, we do not need to repeat this process manually for new itemsets. We can use the permutations function from itertools to generate this list by iterating over all sets of two items. We first import permutations and then extract the unique set of items in two steps. First, we flatten the list of lists into a list of items. Next, we use the set function to identify unique items and then pass the output to a list. After that, we use permutations to generate all sets of two items. We then convert this object to a list and then pass it to the print function.

9. Counting the rules

We can also print the length of the list, which gives us the total number of rules. If 72 rules sounds like a lot for 9 items, look at how quickly the amount of rules increases with the number of items. Just 100 items have 10,0000 associated rules.

10. Looking ahead

In later chapters, we'll make use of a package called mlxtend. This will allow us to preprocess the data, generate itemsets and rules, and filter according to metrics. This will greatly simplify the process of identifying a narrow set of useful rules.

11. Let's practice!

We now know what an association rule is and that there will be too many to examine carefully in large datasets. Let's practice applying this to a separate dataset in some exercises.