Introduction to the case study

1. Introduction to the case study

Thus far you have successfully learned how to do two types of clustering, kmeans and hierarchical, and one type of dimensionality reduction, principal component analysis.

2. Objectives

This chapter is a little different. In this chapter I am going to guide you through a complete analysis using the unsupervised learning techniques you have just learned. I have three reasons for doing this: One, reinforce what you have already learned. Two, add in a few steps not covered before, such as getting and preparing the data, and seeing if the results of unsupervised learning would make good features for supervised learning, and Three, emphasize the creativity required to be successful at unsupervised learning.

3. Example use case

The dataset you will be using in this analysis was published in a paper by Bennett and Mangasarian. Their data consisted of measurements of nuclei of cells of human breast masses. Each observation, or row, is of a single mass or group of cells and consists of ten features. Each feature is a summary statistic of measurements from the cells in that mass. There is also a target variable, or label, in the dataset. The label would be used if you were doing modeling using supervised learning -- it will not be used for modeling during this analysis.

4. Analysis

At a high level you will complete six steps during this analysis. Downloading and preparing the data for modeling, doing some high level exploratory analysis, performing principal component analysis and using visualizations and other mechanisms to interpret the results, completing two types of clustering, understanding and comparing the two types, and finally, combining Principal Component Analysis as a preprocessing step to clustering. During the coding exercises, you will be guided through each step.

5. Review: PCA in R

The exercises immediately following this video include completing principal components analysis on the data. As a reminder, the function in R to do principal component analysis is 'prcomp', and takes as its parameters a matrix of the data with each observation as a row of the matrix and one feature per column of the matrix, plus the options for scaling and centering the data. As covered earlier, if the data uses different scales or units of measure, centering and scaling the data before performing PCA can improve the results of the analysis. And don't forget that the summary function in R, with the output of prcomp() as the input, provides important information about the amount of variability described by each principal component.

6. Unsupervised learning is open-ended

The analysis you are going to step through is but one path that could have been taken -- as you complete this chapter you might want to give thought to what other approaches you might take when presented with an analysis using unsupervised learning.

7. Let's practice!

I hope you'll have fun with the next exercises. We'll help you with hints and templates along the way.