Practical matters: working with real data
Dealing with real data is often more challenging than dealing with synthetic data. Synthetic data helps with learning new concepts and techniques, but the next few exercises will deal with data that is closer to the type of real data you might find in your professional or academic pursuits.
The first challenge with the Pokemon data is that there is no pre-determined number of clusters. You will determine the appropriate number of clusters, keeping in mind that in real data the elbow in the scree plot might be less of a sharp elbow than in synthetic data. Use your judgment on making the determination of the number of clusters.
The second part of this exercise includes plotting the outcomes of the clustering on two dimensions, or features, of the data. These features were chosen somewhat arbitrarily for this exercise. Think about how you would use plotting and clustering to communicate interesting groups of Pokemon to other people.
An additional note: this exercise utilizes the iter.max
argument to kmeans()
. As you've seen, kmeans()
is an iterative algorithm, repeating over and over until some stopping criterion is reached. The default number of iterations for kmeans()
is 10, which is not enough for the algorithm to converge and reach its stopping criterion, so we'll set the number of iterations to 50 to overcome this issue. To see what happens when kmeans()
does not converge, try running the example with a lower number of iterations (e.g., 3). This is another example of what might happen when you encounter real data and use real cases.
This exercise is part of the course
Unsupervised Learning in R
Exercise instructions
The pokemon
dataset, which contains observations of 800 Pokemon characters on 6 dimensions (i.e., features), is available in your workspace.
- Using
kmeans()
withnstart = 20
, determine the total within sum of square errors for different numbers of clusters (between 1 and 15). - Pick an appropriate number of clusters based on these results from the first instruction and assign that number to
k
. - Create a k-means model using
k
clusters and assign it to thekm.out
variable. - Create a scatter plot of
Defense
vs.Speed
, showing cluster membership for each observation.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Initialize total within sum of squares error: wss
wss <- 0
# Look over 1 to 15 possible clusters
for (i in ___) {
# Fit the model: km.out
km.out <- kmeans(___, centers = ___, nstart = ___, iter.max = 50)
# Save the within cluster sum of squares
wss[i] <- ___
}
# Produce a scree plot
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
# Select number of clusters
k <- ___
# Build model with k clusters: km.out
km.out <- kmeans(___, centers = ___, nstart = ___, iter.max = 50)
# View the resulting model
km.out
# Plot of Defense vs. Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
col = km.out$cluster,
main = paste("k-means clustering of Pokemon with", k, "clusters"),
xlab = "Defense", ylab = "Speed")