Basics of hierarchical clustering

1. Basics of hierarchical clustering

Hello everyone! In the previous chapter, you were introduced to the basics of two clustering algorithms. This chapter focuses on performing hierarchical clustering with SciPy. This video looks at the various parameters of the hierarchical clustering algorithm.

2. Creating a distance matrix using linkage

A critical step is to compute the distance matrix at each stage. This is achieved through the linkage method available in scipy-dot-cluster-dot-hierarchy. This process computes the distances between clusters as we go from N clusters to 1 cluster, where N is the number of points. There are four parameters for this method. The first parameter is the observations. The second parameter, method, tells the algorithm how to calculate proximity between two clusters. The metric is the function that decides the distance between two objects. Euclidean distance is a straight line distance between two points on a 2D plane. You can use your own function here. The optimal_ordering is an optional argument that changes the order of linkage matrix. We will not use this argument. Let us explore the method argument.

3. Which method should use?

The second parameter, method, decides how clusters are separated at each step. This is the parameter that we will tweak in this lesson and see the differences. The single method decides the proximity of clusters based on their two closest objects. On the other extreme end, the complete method decides the proximity of cluster centers based on their two farthest objects. The average and centroid methods decide cluster proximities based on arithmetic and geometric means, respectively. The median method uses the median of cluster objects. Finally, the ward method that we used earlier computes cluster proximity using the difference between summed squares of their joint clusters minus the individual summed squares. The ward method focuses on clusters more concentric towards its center.

4. Create cluster labels with fcluster

Once you have created the distance matrix, you can create the cluster labels through the fcluster method, which takes three arguments -the distance matrix, the number of clusters and the criteria to form the clusters based on certain thresholds. We will use the value of maxclust in the criterion argument.

5. Hierarchical clustering with ward method

Let us try to understand the differences between various methods to perform hierarchical clustering on a list of points on a 2D plane. This is the result using the ward method. Notice that clusters are generally dense towards the centers.

6. Hierarchical clustering with single method

Next, we will use the single method to see how the clusters change. Recall the single method used the two closest objects between clusters to determine the inter-cluster proximity. Naturally, the clusters formed when performing clustering through this method are more dispersed. Although the top cluster, labeled 1, is roughly the same, most objects from cluster 3 have shifted to cluster 2.

7. Hierarchical clustering with complete method

In the next and final iteration, we look at the clusters formed by the complete method. This method uses the two farthest objects among clusters to determine inter-cluster proximity. Coincidentally, though, the results of the complete method on the same data points that we used is similar to that of the ward method.

8. Final thoughts on selecting a method

Here are a few thoughts before we complete this lesson. First, there is no right method that you can apply to all problems that you face. You would need to carefully study the data that you are going to handle to decide which method is right for your case, which falls outside the scope of this course.

9. Let's try some exercises

It is now time for you to try some exercises.

This exercise is part of the course

Cluster Analysis in Python

IntermediateSkill Level

4.8+

Start Course for Free

Before you are ready to classify news articles, you need to be introduced to the basics of clustering. This chapter familiarizes you with a class of machine learning algorithms called unsupervised learning and then introduces you to clustering, one of the popular unsupervised learning algorithms. You will know about two popular clustering techniques - hierarchical clustering and k-means clustering. The chapter concludes with basic pre-processing steps before you start clustering data.

Exercise 1: Unsupervised learning: basics Exercise 2: Unsupervised learning in real world Exercise 3: Pokémon sightings Exercise 4: Basics of cluster analysis Exercise 5: Pokémon sightings: hierarchical clustering Exercise 6: Pokémon sightings: k-means clustering Exercise 7: Data preparation for cluster analysis Exercise 8: Normalize basic list data Exercise 9: Visualize normalized data Exercise 10: Normalization of small numbers Exercise 11: FIFA 18: Normalize data

This chapter focuses on a popular clustering algorithm - hierarchical clustering - and its implementation in SciPy. In addition to the procedure to perform hierarchical clustering, it attempts to help you answer an important question - how many clusters are present in your data? The chapter concludes with a discussion on the limitations of hierarchical clustering and discusses considerations while using hierarchical clustering.

Exercise 1: Basics of hierarchical clustering

Current Exercise

Exercise 2: Hierarchical clustering: ward method Exercise 3: Hierarchical clustering: single method Exercise 4: Hierarchical clustering: complete method Exercise 5: Visualize clusters Exercise 6: Visualize clusters with matplotlib Exercise 7: Visualize clusters with seaborn Exercise 8: How many clusters?Exercise 9: Create a dendrogram Exercise 10: How many clusters in comic con data?Exercise 11: Limitations of hierarchical clustering Exercise 12: Timing run of hierarchical clustering Exercise 13: FIFA 18: exploring defenders

This chapter introduces a different clustering algorithm - k-means clustering - and its implementation in SciPy. K-means clustering overcomes the biggest drawback of hierarchical clustering that was discussed in the last chapter. As dendrograms are specific to hierarchical clustering, this chapter discusses one method to find the number of clusters before running k-means clustering. The chapter concludes with a discussion on the limitations of k-means clustering and discusses considerations while using this algorithm.

Exercise 1: Basics of k-means clustering Exercise 2: K-means clustering: first exercise Exercise 3: Runtime of k-means clustering Exercise 4: How many clusters?Exercise 5: Elbow method on distinct clusters Exercise 6: Elbow method on uniform data Exercise 7: Limitations of k-means clustering Exercise 8: Impact of seeds on distinct clusters Exercise 9: Uniform clustering patterns Exercise 10: FIFA 18: defenders revisited

Now that you are familiar with two of the most popular clustering techniques, this chapter helps you apply this knowledge to real-world problems. The chapter first discusses the process of finding dominant colors in an image, before moving on to the problem discussed in the introduction - clustering of news articles. The chapter concludes with a discussion on clustering with multiple variables, which makes it difficult to visualize all the data.

Exercise 1: Dominant colors in images Exercise 2: Extract RGB values from image Exercise 3: How many dominant colors?Exercise 4: Display dominant colors Exercise 5: Document clustering Exercise 6: TF-IDF of movie plots Exercise 7: Top terms in movie clusters Exercise 8: Clustering with multiple features Exercise 9: Clustering with many features Exercise 10: Basic checks on clusters Exercise 11: FIFA 18: what makes a complete player?Exercise 12: Farewell!