1. Clustering analysis: selecting the right clustering algorithm
Hello again! In this video we'll discuss how to choose a clustering algorithm.
2. Clustering algorithms
When the number of features is large as compared to the number of observations, aka the curse of dimensionality, effective model training becomes more challenging. This is especially true for clustering algorithms since they rely on distance calculations. Clustering is one of the most frequently utilized forms of unsupervised learning.
3. Practical applications of clustering
Some of the more common uses of clustering are customer segmentation, document classification, detection of anomalies like insurance or transaction fraud, image segmentation, and many others.
4. Distance metrics: Manhattan (taxicab) distance
The Manhattan or taxicab distance is called such, as similar to the way the streets are laid out in Manhattan, the red, yellow and blue lines in this grid represent the length of the path between the black origin and destination points, a distance calculated by taking the sum of the absolute differences of the Cartesian coordinates of the black points. Here, they happen to be equal, but longer than, the green line, which represents Euclidian distance.
5. Distance metrics: Euclidian distance
This shortest Euclidian distance is derived from the Pythagorean Theorem. Remember a squared plus b squared equals c squared?
6. K-means
The first and perhaps most well-known clustering algorithm is k-means. It essentially has 3 steps. The first involves choosing the initial centroids, or locations, of the center of a cluster. Then, each observation is assigned to its nearest centroid followed by taking the mean value of all of the observations assigned to a given centroid to create new centroids. These last 2 steps are iterated until the centroids do not significantly move.
7. Hierarchical agglomerative clustering
Hierarchical clustering involves successively merging or splitting observations, the hierarchy represented as a tree known as a dendrogram. Agglomerative clustering uses a bottom-up approach where each observation starts in its own cluster, becoming merged into groups of clusters based on a given linkage criteria. Using the dendrogram to select the number of clusters depends on both the linkage criteria and the distance threshold. Here, a horizontal red line crosses 4 vertical lines, representing the 4 clusters found in the dataset.
8. Agglomerative clustering linkage
Ward linkage which minimizes the sum of the squared distances within all clusters, is like using euclidian distance. Others are the maximum or complete linkage which maximizes the distance between observations in pairs of clusters, average minimizes the average of the distances between observations in pairs of clusters, and single minimizes the distance between the observations that are closest in pairs of clusters.
9. Selecting a clustering algorithm
As is the case with most things in machine learning, there is no best way to select a clustering algorithm. One way, however, is to assess cluster stability which can be done by comparing algorithms that share some similarity. For example k-means and hierarchical clustering both use euclidian distance and are therefore comparable. Intra-cluster distance can be computed as the mean of the distances between the points of a cluster and its centroid. Inter-cluster distances can be computed as the mean of the distances between clusters' centroids. In any well formed cluster the intra cluster distance should be less than the inter cluster distances. And to quote from the book Elements of statistical learning, an appropriate dissimilarity measure is far more important in obtaining success with clustering than with choice of clustering algorithm.
10. Clustering functions
A few of the functions you'll use from sklearn dot cluster are kmeans and agglomerative clustering which return their respective clustering algorithms. A trained kmeans model has an attribute inertia underscore which gives the sum of the squared distances of observations to closest cluster center. For building dendrograms, scipy dot cluster dot hierarchy is imported after which the dendrogram function can be called.
11. Let's practice!
Your turn to practice clustering!