Transactional data

1. Transactional data

Welcome to this new chapter where we will dive into metrics and techniques used in Market Basket Analysis. Let's start with the concept of transactional data.

2. What is a transaction?

In a commercial environment, a transaction is generally defined as the activity of buying or selling something, such as buying bread from the grocery shop. Market Basket Analysis deals with data at a transactional level - transactional data lists all items bought by a customer in a single purchase. For instance, we buy one piece of bread and three pieces of cheese from the store. This is considered as one transaction with a Transaction ID here set to 1 - this ID uniquely identifies the purchase done by a customer. If the same customer comes to the store few hours later to buy some products, the set of purchased items is allocated to a different Transaction ID as this is considered a new purchase.

3. The transactional class in R

Let's make R understand we are working with transactions. We can do this by using the transactional class. As we will potentially be dealing with millions of transactions, we need to have a fast and efficient way of working with transactions. These transactions will later be inputted to an association rule mining algorithm. So how can we transform data to a transactional class? You can coerce lists, matrices and dataframes to transactional objects. Like always in data analysis, structuring and preparing your data is very important. Make sure you identify the field or column used to identify both the product and the transaction.

4. Back to the grocery store (1)

Back to the set of transactions we saw earlier: 7 transactions with a total set of 4 items (Bread, Butter, Cheese and Wine). This dataframe contains one row per product purchased. This is a typical dataframe for transactional data: a column for the transaction ID and one with the product name; sometimes you may even have additional information such as a product ID, a timestamp or a physical location related to the transaction.

5. Back to the grocery store (2)

To transform the dataframe into an R transaction class, the easiest way is to group items by transaction ID. We can use the split function. First we need to convert the transactional ID to a factor in order to use it as a grouping attribute. The split function has two main arguments: first the items or products and second the grouping factor, here TID. The output is a list of size 7 with all products being allocated to their respective transaction. For instance, the first transaction includes all four items because these are the products bought in transaction ID 1 - while the third one only includes Bread and Butter.

6. Back to the grocery store (3)

From the list we created, we coerce it to the transactional class by using the "as" function with the "transactions" string. An important function we will be using is the "inspect" function. It allows you to have a closer look at the transactions. Items are now assembled into sets with one row per transaction or basket. Recall that the dataframe "my_transactions" included one row per item.

7. More inspections of transactions

If you are dealing with millions of transactions, be careful not to inspect the whole set of transactions, but rather make use of the "head" function. Likewise, the "tail" function yields the last few records from the transactional dataset. You can as well access specific transactions by using the index. Finally, the "summary" function yields a summary of the transactional data.

8. Overview of transactions

Another way to understand your transactional dataset is by visualizing it using the itemMatrix. The itemMatrix allows to have an overview of all transactions and items at the same time. It is a 2x2 matrix with items on the x axis and transactions on the y axis. A dark cell means that the item belongs to that transaction whereas a white cell means the item is not part of the transaction. The "image" function allows us to generate the ItemMatrix. However, be careful not to display too many transactions at once if working with millions of transactions. This kind of plot yields an overview of the transactions and gives you a hint about the sparsity of the item matrix, in other words how frequently are items part of transactions. The density of the Item matrix is the ratio of black cells to the total number of cells, here 18 and 28 which gives a density of 64%.

9. Let's inspect transactions!

Your turn to inspect transactions from the online retail dataset!