PCA review and next steps

1. PCA review and next steps

Before moving on, let me quickly review the analysis thus far and get you on the way to the final steps.

2. Review thus far

So far in this chapter you have completed 2 new steps typical in unsupervised analysis: downloading the data and performing some basic exploratory analysis. These are steps you will have to do in any machine learning work. You also completed a rather detailed principal component analysis and learned a bit about some latent, or unseen, variables that might exist in the observations.

3. Next steps

As a reminder, the next (and final) steps in this particular analysis are to complete two types of clustering on the data, and combine PCA and clustering together. The first exercise on hierarchical clustering also has you compare the results of the clustering to the diagnosis -- if you were doing supervised learning, this step would provide you insights as to if the clusters would be useful features. Next, there is a comparison of the results of the two types of clustering; this type of work is done to contrast the results of the two algorithms and see if they produce similar or different sub-groupings. Finally, you'll combine PCA and clustering. PCA is often used as a preprocessing step for different types of machine learning -- when done that way it creates a type of regularization that helps avoid overfitting the data. In a coming exercise, you will see how PCA affects the results of clustering.

4. Review: hierarchical clustering in R

Just some quick reminders on hierarchical clustering: The R function for hierarchical clustering is hclust. hclust takes a matrix of the pair wise distance between observations as its input. You will continue to use Euclidean distance for this exercise.

5. Review: k-means in R

And to do kmeans in R, use the kmeans function. The kmeans function takes a matrix of the data, the same matrix you prepared earlier in this chapter. kmeans also requires the number of clusters to be decided before the algorithm is run, which is specified using the centers parameter to kmeans. Recall that the kmeans algorithm has a stochastic or random aspect. To improve the chances of finding a global minimum kmeans is run repeatedly keeping the 'best' results, as measured by total within cluster sum of squares, from all the runs -- the number of times kmeans is run is specified by the nstart parameter to kmeans.

6. Let's practice!

Ok, let's get started.