Effects of scale
You have learned that when a variable is on a larger scale than other variables in your data it may disproportionately influence the resulting distance calculated between your observations. Lets see this in action by observing a sample of data from the trees data set.
You will leverage the scale() function which by default centers & scales our column features.
Our variables are the following:
- Girth - tree diameter in inches
- Height - tree height in inches
Cet exercice fait partie du cours
Cluster Analysis in R
Instructions
- Calculate the distance matrix for the data frame
three_treesand store it asdist_trees. - Create a new variable
scaled_three_treeswhere thethree_treesdata is centered & scaled. - Calculate and print the distance matrix for
scaled_three_treesand store this asdist_scaled_trees. - Output both
dist_treesanddist_scaled_treesmatrices and observe the change of which observations have the smallest distance between the two matrices (hint: they have changed).
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Calculate distance for three_trees
dist_trees <- ___
# Scale three trees & calculate the distance
scaled_three_trees <- ___
dist_scaled_trees <- ___
# Output the results of both Matrices
print('Without Scaling')
___
print('With Scaling')
___