Introduction to K-means

1. Introduction to K-means

In the last chapter, you learned how to use the hierarchical clustering method to group observations. In this chapter you will learn about another popular method of clustering called k-means clustering. To learn how this method works, let's revisit an expanded version of the soccer lineup data you have been working with.

2. k-means

This data consists of twelve players on a soccer field at the start of the game. At this point in the game the teams are positioned on opposite sides of the field. We would expect that clustering can be effective in identifying teams and assigning each player to the correct team. The first step of k-means clustering involves making a decision of how many clusters to generate. This is the k in k-means clustering. This can be decided on in advance based on our understanding of the data or it can be estimated from the data empirically. We will discuss the estimation methods later in this chapter. In this example we can leverage what is known about our data. Since we know that soccer is played with two teams we can use a k of 2 for the desired number of clusters. Once k is established the algorithm can proceed.

3. k-means

The first step in the k-means algorithm is to initialize k points at random positions in the feature space, we will refer to these points as the cluster centroids. In this data we will illustrate our two centroids using a red and a blue x.

4. k-means

For each observation the distance is calculated between the observation and each centroid. In k-means clustering, the distance is limited euclidean only.

5. k-means

The observations are initially assigned to the centroid to which they are closest to.

6. k-means

We can see this decision boundary represented by the color space.

7. k-means

The observations now have an initial assignment to one of the two clusters.

8. k-means

The next step involves moving the centroids to the central points of the resulting clusters.

9. k-means

Again, the distance of every observation is calculated to each centroid.

10. k-means

And they are re-assigned based on which centroid they are closest to.

11. k-means

This process continues until the centroids stabilize and the observations are no longer reassigned. This is the fundamental algorithm of kmeans clustering.

12. kmeans()

To generate the kmeans model in R you will use the function of the same name. We will continue to work with the lineup data frame that you explored in chapter two. The kmeans function is run with the data as the first argument and the desired number of clusters provided using the centers parameter. Centers in this case is synonymous with k.

13. Assigning clusters

Once the model is run you will want to extract the cluster assignments in order to explore their characteristics. You can extract the cluster assignments directly from the model object. The vector of assignments is stored in the model object and is aptly named cluster. As before you can append this vector to your data frame in order to further explore the results of your clustering.

14. Let's practice!

Now that you know how kmeans works and how to use it in R let's practice with the soccer lineup data.