Practical matters: scaling
Recall from the video that clustering real data may require scaling the features if they have different distributions. So far in this chapter, you have been working with synthetic data that did not need scaling.
In this exercise, you will go back to working with "real" data, the pokemon
dataset introduced in the first chapter. You will observe the distribution (mean and standard deviation) of each feature, scale the data accordingly, then produce a hierarchical clustering model using the complete linkage method.
This exercise is part of the course
Unsupervised Learning in R
Exercise instructions
The data is stored in the pokemon
object in your workspace.
- Observe the mean of each variable in
pokemon
using thecolMeans()
function. - Observe the standard deviation of each variable using the
apply()
andsd()
functions. Since the variables are the columns of your matrix, make sure to specify 2 as theMARGIN
argument toapply()
. - Scale the
pokemon
data using thescale()
function and store the result inpokemon.scaled
. - Create a hierarchical clustering model of the
pokemon.scaled
data using the complete linkage method. Manually specify themethod
argument and store the result inhclust.pokemon
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# View column means
# View column standard deviations
# Scale the data
# Create hierarchical clustering model: hclust.pokemon