Get startedGet started for free

Capturing K clusters

1. Capturing K clusters

In the last few exercises you explored the ways in which it is possible to group multiple observations together using linkage analysis. Now you are ready to leverage this technique to group your observations into a predefined number of clusters. So let's revisit the soccer example with a few more players.

2. Grouping soccer players

In this case you have the positions of six players at the start of a game and you would like to infer which players belong to which team using hierarchical clustering.

3. Grouping soccer players

A euclidean distance matrix was calculated for each pair of players and is now used to group players using a complete linkage criteria. This algorithm iteratively proceeds to group the players until they are all under a single group like so...

4. Grouping soccer players

5. Grouping soccer players

6. Grouping soccer players

7. Grouping soccer players

Once this is completed we can work backwards to capture a desired number of clusters. At this moment, there is just one cluster.

8. Extracting 2 clusters

If we remove the last grouping like so.

9. Grouping soccer players

We have two distinct clusters.

10. Grouping soccer players

The red cluster contains players five and six while the blue cluster contains players one through four. Just like peeling an onion we can further split this into more parts by removing the previous linkage grouping.

11. Grouping soccer players

In this case it was group 1, 2 and 4 linked to player 3.

12. Grouping soccer players

And now we have three distinct clusters (red, blue, and green). So, the process of identifying a pre-defined number of clusters, which we will refer to as k is as simple as undoing the last k-1 steps of the linkage grouping. Now let's learn how to do this in R.

13. Hierarchical clustering in R

The positions of the players are available in the data frame called players. As before, to get the euclidean distance between each pair of players we use the dist function. To perform the linkage steps we will use the hclust function which accepts a distance matrix, in our case dist_players and a linkage method. The default linkage method is the complete method. This results in a hclust object containing the linkage steps and can now be used to extract clusters.

14. Extracting K clusters

In order to determine which observations belong to which cluster, we use the cutree function. In this case we want to have two clusters because we know that there are two teams. So we provide the function with an hclust object and specify that we want a k of two. The output of cutree is a vector which represents which cluster each observation belongs to. We can append this back to our original data frame to do further analysis with the now clustered observations.

15. Visualizing K Clusters

One way we can analyze the clustering result is to plot the positions of these players and color the points based on their cluster assignment. Here we do this using ggplot. Remember that this clustering incorporated several decisions, the distance metric used was euclidean, the linkage metric used was complete and the k was 2. Changing any of these may, and likely will, impact the resulting clusters. This is why it is crucial to analyze the results to see if they actually make sense. For example in this case, the cluster analysis was aimed at identifying the teams to which the players belong to based on their positions at the start of the game. Since soccer games have the same number of players on each team, we know that the results of this clustering are incorrect and would need to consider a different distance or linkage criteria. Incorporating an understanding of your data and your problem into clustering analysis is the key to successfully leveraging this tool.

16. Let's practice!

So, let's do just that with some exercises.