Get startedGet started for free

Introduction to k-means clustering

1. Introduction to k-means clustering

Now that we have some conceptual understanding of unsupervised learning and the different goals of unsupervised learning, let's dig right in with one popular approach to unsupervised learning.

2. k-means clustering algorithm

K-means is a clustering algorithm, an algorithm used to find homogeneous subgroups within a population. K-means is the first of two clustering algorithms to be covered in this course. The K-means algorithm works by first assuming the number of subgroups, or clusters, in the data and then assigns each observation to one of those subgroups. In the next video, we will go deeper into how the k-means algorithm works to achieve this goal. For example, one might hypothesize that this data shown on the screen contain 2 subgroups. The k-means algorithm would assign all points in the top right hand corner to one subgroup and all observations in the bottom left hand corner to the other subgroup.

3. k-means in R

k-means in R comes with the base R install. Invoking k-means in R is simply a function call to kmeans() function, typically with three parameters. The first parameter is the data, represented as 'x' here. In k-means, like many machine learning algorithms, the data is structured in a matrix with one observation per row of the matrix and one feature in each column of the matrix. The next parameters for 'kmeans' is the number of predetermined groups or clusters. This parameter is called 'centers', for reasons that will be covered in the next video. Finally, the kmeans algorithm has a random component. The implication of this stochastic component is that a single run of kmeans may not find the optimal solution to kmeans. To overcome the random component of the algorithm, 'kmeans' can be run multiple times with the 'best' outcome across all runs being selected as the single outcome. 'nstart' is the parameter that specifies the number of times 'kmeans' will be repeated. There are other parameters to 'kmeans' and I encourage you to check those out in the R documentation when you are ready.

4. First exercises

The first exercises use synthetic data that were generated from three subgroups. But if you plot the data it might only appear to be two subgroups. Later in this chapter, you will see how k-means can be used to estimate the number of subgroups when the number of subgroups is not known a priori. Later in this first chapter of the course, you will get experience applying 'kmeans' with a real world, but fun, dataset.

5. Let's practice!

With that information, let's get started on the first exercise using 'kmeans'.