Mining association rules

1. Mining association rules

In the last lesson, you explored the movies dataset and converted your movie consumption data to a set of transactions. Now that you have that, it's time to create association rules that will inform your movie recommendations. Let's see what rules can be inferred from the set of transactions!

2. Frequent itemsets with the apriori

Let's start with frequent itemsets - remember that a frequent itemset typically refers to a set of items that often appear together in a transactional data set. In R, we extract the itemsets by using the apriori function and setting the target argument to "frequent" (or eventually to "frequent itemsets"). In the Groceries example, we also force to retrieve itemsets of size at least equal to 2. When inspecting the first frequent itemsets, you notice that "vegetables" and "milk" are items that are often purchased in combination. Are you surprised of that combination or about any other combination from the list?

3. Rules with the apriori

Similarly, to extract association rules, you can use the same syntax but this time replace the target parameter to "rules". Recall that when you inspect rules, to use "head" or "tail" and preferably use the "sort" function. In this case, the rule with highest confidence is : "IF Rice and Sugar THEN milk". You may still be wondering: How can we set adequate parameter values in the apriori function, such as minimum support and confidence?

4. Choose parameters arules

One possibility to choose the parameters to set in the "apriori" function is the following. Let's compute the number of rules extracted as a function of the parameters that need to be set in the apriori. In a sense you are doing a grid search where you try to figure out the number of rules for each combination of parameters. In the case of movie recommendations, we want to have a different set of rules for various confidence levels - this will allow us to choose the set of rules according to the strength of connection between movies. This will help us better recommend movies to users. In this code snippet, we loop over a range of values for confidence, and use the "length" function to extract the number of rules. Once we have saved the number of rules, you'd like to plot them versus the confidence level. The "qplot" function from the "ggplot" package allows you to quickly plot the number of rules as a function of the confidence level. Note here that we keep the minimum support fixed, but we could have as well looped over that parameter or even looped over both confidence and support parameters. Note however that in some cases, this step can be time-consuming.

5. Subsetting rules

Given the large number of rules that are extracted, you may want to find out some specific rules. Subsetting extracted rules is very commonly used and gives you the flexibility to find the information you are looking for. In the Groceries dataset, we are interested in finding all rules that contain "cheese" or "milk" and which have a confidence level higher than 0.95. Note we use items as keyword and use the "IN" symbol. In case, you would like to have both "cheese" and "milk" to appear in the rule, use the "AIN" symbol. Finally, you have the possibility of specifying the items that should appear on the right hand side of the rule, or the left (or even both). This shows you how flexible you are to mine association rules. In the case of movies, we may want to select rules for which the confidence or the lift is beyond a certain threshold. We are only interested in movie rules for which we are pretty confident they hold true.

6. Let's mine the movie dataset!

Enough talking, let's now practice our skills with the Movie dataset. We will try to understand watchers' preferences.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Market Basket Analysis in R

IntermediateSkill Level

4.9+

24 reviews