Effects of scale

You have learned that when a variable is on a larger scale than other variables in your data it may disproportionately influence the resulting distance calculated between your observations. Lets see this in action by observing a sample of data from the trees data set.

You will leverage the scale() function which by default centers & scales our column features.

Our variables are the following:

Girth - tree diameter in inches
Height - tree height in inches

Calculate the distance matrix for the data frame three_trees and store it as dist_trees.
Create a new variable scaled_three_trees where the three_trees data is centered & scaled.
Calculate and print the distance matrix for scaled_three_trees and store this as dist_scaled_trees.
Output both dist_trees and dist_scaled_trees matrices and observe the change of which observations have the smallest distance between the two matrices (hint: they have changed).

Calculating Distance Between Observations

Hierarchical Clustering

K-means Clustering

Case Study: National Occupational Mean Wage

Exercise

Effects of scale

Instructions