1. Exploring your data set
Hi and welcome to this case study in network analysis. In this course we'll be putting into practice all the concepts you learned in your intro to network analysis using real world data sets. In this first lesson we'll be using a data set that is several daily snapshots of items purchased together from Amazon over the course 2003. These are known as co-purchases.
2. Exploring your data
The first step when you're working with a new data set is to explore the raw data. While we'll go on to create an igraph object out of this raw data, it's useful to understand the data associated with each vertex. The columns that make up the graph itself are the 'to' and 'from' columns, each with a vertex id. Then there's associated metadata, such as the name of the product, the type of the product, sales rank, etc. We'll use this later in the lesson.
3. Creating the graph
We'll use dplyr to help create a graph directly from our dataframe. The first step in our pipeline is to filter down to a single date, it's important that we look at just single day so we don't conflate co-purchases on different days. Our next step is to select just the from and to columns, and lastly we say it's a directed graph. Checking the size we can see there are around 10,000 vertices.
4. Visualize the graph
The graph is obviously quite large, so we'll just look at a small subgraph. First we'll use the function "induced_subgraph" to make a new graph that is just the first 500 vertices. Next we'll delete any vertices with a degree of 0. Finally we'll make a plot. As you can see this might look a bit different than your platonic ideal of a graph because just a few things tend to be purchased in a single Amazon order. That's why we see all these little clusters of connected vertices. iGraph provides a way to count all these small subpatterns, which are called dyads (two connected vertices), and triads (three connected vertices).
5. Dyads
When we run a dyad census on our graph using dyad_census() we'll see three outputs from igraph, a count of null, asymmetric, and mutual dyads. These correspond to the following subgraph types. Null is when there is no connection, asymmetric is when there is a single directed edge, and mutual is when there are two directed edges back and forth. This underlying pattern has implications for graph level metrics like reciprocity.
6. Triads
Triads get even more complicated because in a directed graph there are 16 possible triad types. To understand all of these a common three digit code is used. The first number is a count of the pairs of vertices connected by a bidirectional symmetric edge, the second number is the count of the pairs of vertices connected by an asymmetric edge, and the third number is a count of pairs of unconnected vertices. Letter codes C, D, U, and T are used to denote whether a triad is cyclic (like 10), single edges go down from the top like 7 or 12, or Up from the bottom like 8, or transitive like 9 (a concept that says if any two vertices in a triad are connected to each other then there must exist a connection between the 3rd. It should be clear that patterns 1, 2, 3 are essentially the dyad patterns. When we run an iGraph triad census function triad_census(), we'll get a count of each of these 16 possibilities.
7. Let's practice
Now it's time to count the dyads and triads in the co-purchase graph and then see how those numbers relate to the graph level metrics of transitivity and reciprocity.