Selecting number of clusters

In this exercise, you will compare the outputs from your hierarchical clustering model to the actual diagnoses. Normally when performing unsupervised learning like this, a target variable isn't available. We do have it with this dataset, however, so it can be used to check the performance of the clustering model.

When performing supervised learning—that is, when you're trying to predict some target variable of interest and that target variable is available in the original data—using clustering to create new features may or may not improve the performance of the final model. This exercise will help you determine if, in this case, hierarchical clustering provides a promising new feature.

This exercise is part of the course

Unsupervised Learning in R

View Course

Exercise instructions

wisc.data, diagnosis, wisc.pr, pve, and wisc.hclust are available in your workspace.

  • Use cutree() to cut the tree so that it has 4 clusters. Assign the output to the variable wisc.hclust.clusters.
  • Use the table() function to compare the cluster membership to the actual diagnoses.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Cut tree so that it has 4 clusters: wisc.hclust.clusters


# Compare cluster membership to actual diagnoses