K-means: determine the k

K-means needs the number of clusters as an argument. There are many ways to look at the optimal number of clusters and a good way might depend on the data you have.

One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes (the calculation of total WCSS was explained in the video before). When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.

K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function set.seed() can be used to deal with that.

This exercise is part of the course

Helsinki Open Data Science

View Course

Exercise instructions

Set the max number of clusters (k_max) to be 10
Execute the code to calculate total WCSS. This might take a while.
Visualize the total WCSS when the number of cluster goes from 1 to 10. The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.
Run kmeans() again with two clusters and visualize the results

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- "change me!"

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, centers = "change me!")

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

Edit and Run Code