Get startedGet started for free

Silhouette analysis: observation level performance

1. Silhouette analysis: observation level performance

In the last series of exercises, you utilized the elbow method to estimate a suitable value of k. In this lesson, you will learn about the silhouette analysis method. This approach provides a different lens through which you can understand the results of your cluster analysis. It can be used to determine how well each of your observations fit into its corresponding cluster and can be leveraged as an additional method for estimating the value of k.

2. Soccer lineup with K = 3

Continuing with our soccer lineup dataset, we will start with the observations already clustered using kmeans with a k of three.

3. Silhouette width

Silhouette analysis involves calculating a measurement called the silhouette width for every observation. The silhouette width consists of two parts. The within cluster distance C and the closest neighbor distance N. We'll work with player number 3 to illustrate this calculation.

4. Silhouette width

The within cluster distance for an observation is the average euclidean distance from that observation to every other observation within the same cluster. In this case the distances are represented by the arrows to the other 3 members of the green cluster.

5. Silhouette width

The closest neighbor distance for an observation is the average distance from that observation to the points of the closest neighboring cluster.

6. Silhouette width

It is calculated for the red cluster like so.

7. Silhouette width

Then the blue cluster. The smallest average distance to our observation is then used as the closest neighbor distance. In this case the blue cluster is clearly closer.

8. Silhouette width: S(i)

Using the values of N and C the silhouette width can be calculated as shown here.

9. Silhouette width: S(i)

More importantly is the intuitive interpretation of this value. A value close to one suggests that this observation is well matched to its current cluster. A value of 0 suggests that it is on the border between two clusters and can possibly belong to either one. While a value of -1, or close to -1 suggests that this observation has a better fit with its closest neighboring cluster. What do you think is the silhouette width for player 3? It sits on the border between blue and green so I'm guessing it's probably close to zero.

10. Calculating S(i)

We can calculate the silhouette width for each observation by leveraging the pam function from the cluster library. Note, that the pam function is very similar, but is not identical to kmeans. Since we are just using it to characterize our kmeans clusters we can ignore this difference. The pam function requires a data frame and a desired number of clusters provided by the parameter k. The silhouette widths can be accessed from the pam model object as shown here.

11. Silhouette plot

Or they can be visualized using the silhouette plot like so. In this plot the bars represent the silhouette widths for each observation. Look at observation three, like we guessed, it's value is close to zero.

12. Silhouette plot

Also, note at the bottom of this plot is the average silhouette width across the twelve observations.

13. Average silhouette width

This measurement can be retrieved from the model object as shown here. And, it can be interpreted in a manner similar to the silhouette width for an observation. In this case the average is well above zero suggesting that most observations are well matched to their assigned cluster. Now that you have a way of measuring the effectiveness of the clustering, you can perform an analysis similar to the elbow plot and calculate the average silhouette widths for multiple values of k. The greater the average width the better the individual observations match to their clusters.

14. Highest average silhouette width

Similar to the elbow plot we can leverage the map double function to run pam across multiple values of k and record the average silhouette width for each, likewise we can append these measurements to a data frame.

15. Choosing K using average silhouette width

And use ggplot to see the relationship between k and the average silhouette width.

16. Choosing K using average silhouette width

Not surprisingly, the highest average silhouette width is for a k of two, and would be the recommended value based on this method.

17. Let's practice!

Now that you know how silhouette analysis works, let's try it out.