1. Clustering analysis: choosing the optimal number of clusters
Welcome to the last lesson of Chapter 3. You've really come a long way! As promised, we're going to cover how to select the optimal k for clustering methods, particularly for k-means.
2. Methods for optimal k
The 2 most-used methods for determining the optimal value for k on a given dataset are the silhouette method and the elbow method.
3. Silhouette coefficient
The silhouette method uses the silhouette coefficient, which is composed of 2 scores, the mean distance between an observation and all of the other observations in the same cluster and the mean distance between each observation and all others in the next nearest cluster.
4. Silhouette coefficient values
The value of the coefficient is between -1 and 1 with 1 denoting that the observation is very near others in the cluster and far away from those observations in other clusters. Negative 1 is the worst score and an observation is not near others in the same cluster and is not far from observations in other clusters and may have actually been assigned to the wrong cluster. A score of 0 indicates that there is overlap among the clusters, in other words, that the observation is either on or close to a decision boundary line between 2 clusters.
5. Silhouette score
Luckily, there is a convenient function from sklearn.metrics called silhouette underscore score which, when called on the data matrix and the labels from a trained kmeans model, returns the mean silhouette coefficient of all observations as one simple to interpret score. This is an example of a fancy plot where this score is used on the classic iris dataset. If you'd like to have a look at the code used to create it, the link is on the bottom of the slide.
6. Elbow method
The elbow method is simply a visualization technique that, if the resulting plot looks like an arm, then the elbow on the arm is the optimal k.
It uses the sum of the squared distances from each observation to it's nearest cluster center, or centroid as it's also called, which you have also come to know as the inertia attribute from a trained kmeans model. As you can see in this elbow plot, the sum of squares continues to decrease as values for k increases. Intuitively, this makes sense because the more clusters there are, the closer any given observation is going to be to one of them. So selecting the lowest value for sum of squares isn't going to be the best approach.
In the exercises, you'll plot the inertia from a range of k values to find the elbow and compare it to the results of the hierarchical agglomerative clustering dendrogram and model exercises from the last lesson.
7. Optimal k selection functions
A few of the functions, some of which will likely be review, but are mentioned here just in case, are, from sklearn, cluster dot Kmeans which gives the kmeans clustering algorithm and metrics dot silhouette score, which returns the score between negative 1 and 1. The dot inertia underscore attribute gives the sum of the squared distances of the observations to the closest centroid for a trained kmeans model. Range, of course is a base function that returns a list of values beginning with the value passed as the first argument up to but not including the second value. And, the dot append method appends the value passed as the argument to an already existing list, which you'll use specifically for inertia in the elbow method exercise.
8. Let's practice!
Alright, it's your turn to practice finding the best value of k once and for all!