Effects of scale
You have learned that when a variable is on a larger scale than other variables in your data it may disproportionately influence the resulting distance calculated between your observations. Lets see this in action by observing a sample of data from the trees
data set.
You will leverage the scale()
function which by default centers & scales our column features.
Our variables are the following:
- Girth - tree diameter in inches
- Height - tree height in inches
This exercise is part of the course
Cluster Analysis in R
Exercise instructions
- Calculate the distance matrix for the data frame
three_trees
and store it asdist_trees
. - Create a new variable
scaled_three_trees
where thethree_trees
data is centered & scaled. - Calculate and print the distance matrix for
scaled_three_trees
and store this asdist_scaled_trees
. - Output both
dist_trees
anddist_scaled_trees
matrices and observe the change of which observations have the smallest distance between the two matrices (hint: they have changed).
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Calculate distance for three_trees
dist_trees <- ___
# Scale three trees & calculate the distance
scaled_three_trees <- ___
dist_scaled_trees <- ___
# Output the results of both Matrices
print('Without Scaling')
___
print('With Scaling')
___