Comparing more than two observations
1. Comparing more than two observations
At the end of chapter 1, you were asked to review a question that you may not have known how to answer. Let's start by revisiting this question.2. The closest observation to a pair
You were presented with a distance matrix that contained the euclidean distances between four soccer players. You know that the closest two players are 1 and 4 with a distance value of 10. In order to cluster more than two observations together you need to determine which of these statements are true. Is observation 2 closest to the newly formed group 1, 4? Or is it observation 3?3. Linkage criteria: complete
To answer this question you must decide on how to measure the distance from group 1-4 to these observations. One approach we can take is to measure the maximum distance of each observation to the two members of the group. To calculate this aggregated distance between observation two and group 1-4 we would get take the larger of the two distances from 2 to 1 and 2 to 4. The distance from 2 to 1 is 11-point-7 and the distance from 2 to 4 is 20-point-6. The larger of the two values is of course 20-point-6 and hence is our maximum distance. We can apply the same logic when comparing observation 3. Resulting in a maximum distance of 16-point-8. Using this approach we can say that based on the maximum distance, observation three is closer to group 1-4.4. Hierarchical clustering
Hierarchical clustering is just a continuation of this approach. This clustering method iteratively groups the observations based on their pairwise distances until every observation is linked into one large group. The decision of how to select the closest observation to an existing group is called the linkage criteria. In the previous example we decided that observation three was the closest based on the maximum distance between it and group 1-4. The approach we used is formally called the complete linkage criteria.5. Grouping with linkage & distance
Let's see the hierarchical clustering method in action using a visual representation.6. Grouping with linkage & distance
The distances between the four players have already been calculated and are shown.7. Grouping with linkage & distance
We know that players 1 and 4 have the shortest distance and will be grouped first.8. Grouping with linkage & distance
We are now presented with three options: add player 2 to group 1-4, add player 3 to group 1-4 or start a new group for players 2 and 3. The decision will be made based on which option results in the smallest distance.9. Grouping with linkage & distance
As before, 2 and 3 have a distance of 18.10. Grouping with linkage & distance
To calculate the distance between players 2 and group 1-4 we will use the complete linkage method, which is the maximum of the distances between observation two and each member of group 1-4. The resulting linkage-based distance is 20-point-6.11. Grouping with linkage & distance
Applying the same for player 3 we get a linkage distance of 16-point-8.12. Grouping with linkage & distance
Of these three options, the grouping of player 3 with 1 and 4 is selected because it has the smallest distance value.13. Grouping with linkage & distance
The next round of grouping doesn't require any decision making, we simply aggregate observation two with group 1-3-4.14. Grouping with linkage & distance
Now you have an iterative binary grouping of your four observations. The order in which these observations are grouped generates a hierarchy based on distance, and hence is called hierarchical clustering.15. Linkage criteria
There are many different linkage methods that have been developed but for this course you will focus on the three most commonly used ones. Complete linkage, which we've learned is the maximum distance between two sets. Single linkage, which is the minimum distance. And average linkage, which - you guessed it - is the average distance between two sets. As you progress through this chapter you will have a chance to see the impact this decision can make in the final clustering.16. Let's practice!
Let's proceed with some exercises.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.