Clustering on PCA results
In this final exercise, you will put together several steps you used earlier and, in doing so, you will experience some of the creativity that is typical in unsupervised learning.
Recall from earlier exercises that the PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.
Let's see if PCA improves or degrades the performance of hierarchical clustering.
This exercise is part of the course
Unsupervised Learning in R
Exercise instructions
wisc.pr
, diagnosis
, wisc.hclust.clusters
, and wisc.km
are still available in your workspace.
- Using the minimum number of principal components required to describe at least 90% of the variability in the data, create a hierarchical clustering model with complete linkage. Assign the results to
wisc.pr.hclust
. - Cut this hierarchical clustering model into 4 clusters and assign the results to
wisc.pr.hclust.clusters
. - Using
table()
, compare the results from your new hierarchical clustering model with the actual diagnoses. How well does the newly created model with four clusters separate out the two diagnoses? - How well do the k-means and hierarchical clustering models you created in previous exercises do in terms of separating the diagnoses? Again, use the
table()
function to compare the output of each model with the vector containing the actual diagnoses.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- ___(dist(wisc.pr$___[, ___:___]), method = ___)
# Cut model into 4 clusters: wisc.pr.hclust.clusters
# Compare to actual diagnoses
# Compare to k-means and hierarchical