1. Evaluating different values of K by eye
In the last two exercises, you explored the results for two different values of k using the same data. You knew that a k of 3 was clearly incorrect because you applied content expertise to this problem by stating that there are only two teams in a game of soccer and that the teams have the same number of players.
But, what happens when you don't know in advance what the right value of k is?
In this course you will learn two methods that address this challenge by estimating k empirically from the data. In this video and the accompanying exercises you will build an intuition for one of these methods, the elbow method.
2. Total within-cluster sum of squares: k = 1
The elbow method relies on calculating the total within cluster sum of squares across every cluster, that is the sum of euclidean distances between each observation and the centroid corresponding to the cluster to which the observation is assigned.
Here this is represented by the dashed lines between the centroid and each observation.
While k = 1 isn't really clustering, it can be helpful for the elbow analysis. As such we record the total within cluster sum of squares for the value of k = 1.
3. Total within-cluster sum of squares: k = 2
We repeat this step for k = 2.
You can already see that the dashed lines are on average shorter and we can expect the total within cluster sum of squares to drop.
Which of course it does.
4. Total within-cluster sum of squares: k = 3
Same goes for a value of k = 3.
5. Total within-cluster sum of squares: k = 4
And for k = 4.
We can continue this calculation so long as k is less than our total number of observations.
6. Elbow plot
In this case we have calculated this for values of k from one through ten.
You may notice a trend that as k increases the total within cluster sum of squares keeps decreasing. This is absolutely natural and expected, just think about it, the more you segment your data the more your points just group together into smaller and more compact clusters until you obtain many clusters with only one or two members.
What we are looking for is the point at which the curve beings to flatten out, affectionally referred to as the elbow. In this case we can see that there is a precipitous drop going from a k of one to two and then a leveling off when moving between a k of 2 and 3 and onward.
7. Elbow plot
As such we can claim that the elbow in this case occurred where k = 2 and would consider using this estimated value of k.
8. Generating the elbow plot
Now that you know how the elbow plot is built, let's learn how to build it in R.
The first piece you will need to know is how to calculate the total within cluster sum of squares.
Conveniently, the kmeans function already takes care of this for you. All you need to do is to extract it from the model object like so.
9. Generating the elbow plot
Because you want to calculate this for multiple values of k you will need to create multiple models and extract their corresponding values.
To do this I recommend leveraging the map double function from the purrr library.
The code shown here iterates over values of k ranging from one to ten in order to build corresponding models and extract their total within-Cluster sum of squares values.
You can append this vector to the corresponding vector of k values to create a data frame.
10. Generating the elbow plot
Which you can then use to plot the elbow plot like so.
11. Let's practice!
Let's try it out!