Clustering linkage and practical matters

1. Clustering linkage and practical matters

I have a couple more details of hierarchical clustering to cover before wrapping up this chapter.

2. Linking clusters in hierarchical clustering

The first detail to cover is how the distance between clusters is determined. As soon as the first two observations are combined into a cluster, the hierarchical clustering algorithm needs rules about how to measure the distance between clusters. There are four methods available in R to measure the distance, or similarity, between clusters. The first method, and the method that is the default to the hclust() function, is the 'complete' method. In the complete method, the distance -- or similarity -- is determined pairwise between all observations in cluster 1 and cluster 2, and the largest distance is used as the distance between the clusters. The second common linkage method is the single method. Again, the pairwise similarity is calculated between points in each cluster, and the smallest such similarity is used as the distance between the clusters. The third common linkage method is the average method. This method uses the average of the pairwise similarities as the distance between the two clusters. The final method works a little different. In the 'centroid' method the centroid of cluster 1 is calculated and the centroid of cluster 2 is calculated, and the distance between the clusters is the distance between the centroids.

3. Linking methods: complete and average

In practice, the choice of linkage is one of those model parameters you will need to choose based on the insights provided by the distance methods. As a rule of thumb, 'Complete' and 'Average' tend to produce more balanced trees and are the most commonly used. Here we see the same data with complete linkage on the left hand side and average linkage on the right hand side.

4. Linking method: single

'Single' tends to produce trees where observations are fused in one at a time, producing unbalanced trees.

5. Linking method: centroid

'Centroid' linkage can create inversions where clusters are fused below either of the individual clusters; this is undesirable behavior -- as such, this method is used much less often than the others. You can see the inversion here in the clusters with the boxes around them. Those clusters have been fused into the tree below where the individual clusters have been fused.

6. Linkage in R

Specifying linkage in R is only a matter of specifying the 'method' parameter in the call to the hclust() function. The value of the parameter is a string specifying the linkage method -- here we show creating hierarchical cluster models with complete, average, and single linkages.

7. Practical matters

As a final practical matter, many of the machine learning methods, including kmeans and hierarchical clustering, are sensitive to the data which is on different scales or units of measurement. To resolve this, the data is transformed through a linear transformation before performing clustering. This transformation subtracts the mean of a feature from each of the observations, and divides each feature observation by the standard deviation of the feature. This is sometimes referred to as normalization and has the effect of producing a population where the normalized feature has a mean of zero and a standard deviation of one. If you know any of the features are on different scales or units of measure, then it is customary to normalize all the features. Even when the same scales and units of measures are used, it is good practice to check the variability of the means and standard deviations of the features. If the means and standard deviations vary across the features scaling is in order.

8. Practical matters

To check the means of all the features the colMeans() function is used, passing in the data matrix. Because features are in the columns this will return the mean value of each feature in the given observations. To calculate the standard deviation of each feature, the apply() function is used, applying the sd() function to each column, or axis 2, of the matrix.

9. Practical matters

Producing a matrix where all features have been normalized is done by passing the original matrix to the scale() method in R. The output is a matrix of the same size, with each feature normalized. Here we show checking that the normalized matrix has column means which are zero, within floating point precision, and column standard deviations of 1.

10. Let's practice!

Alright, let's get some practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.