1. Introduction to PCA
2. Two methods of clustering
Thus far in this course you have learned about two techniques for performing clustering to find similar subgroups within an overall population.
The next type of unsupervised machine learning which will be covered is dimensionality reduction. Dimensionality reduction has two main goals: to find structure within features and to aid in visualization.
3. Dimensionality reduction
In this course, I will be covering one particular and popular method of dimensionality reduction -- principal components analysis.
Principal component analysis has three goals -
First, PCA will find a linear combination of the original features. A linear combination just means it takes some fraction of some or all of the features and adds them together. These new features are called principal components.
Second, in those new features, PCA will maintain as much variance as it can from the original data for a given number of principal components.
Third, the new features are uncorrelated, or orthogonal to each other.
4. PCA intuition
As with clustering methods, lets build up some intuition about how principal components analysis achieves these goals by studying the simplest example possible.
Here we show a data set with two dimensions, or features -- x on the horizontal axis and y on the vertical axis.
Each point on the plane represents an observation in the data set, plotted as points.
The goal of PCA would be to find a lower dimensional representation of this data that maintains and describes the maximum amount of variance from the original data. Because the original data is of 2 dimensions, the lower dimensional representation will be only one dimension, or feature. That is, PCA will help us map this data from 2 features to 1 feature while maintaining as much data variability as possible.
5. PCA intuition
The first step is to fit a regression line through the data. This line is determined such that it explains all the data with the minimum residual error when those points are mapped to the line.
This line is the first principal component of this data.
6. PCA intuition
Now we have a new dimension along the line. Each point is then mapped on the line -- projected onto the line. This projected value on the new line is sometimes referred to as the component scores or the factor scores.
7. Visualization of high dimensional data
Principal components analysis is also often used to aid in visualization of data of high dimensions. Data with more than 3 or 4 features is often difficult to develop an effective visualization for communication, and can put high cognitive loads on the consumers of the visualization.
8. Visualization
Here I am showing the well known iris data set. The iris data set consists of four physical measures of a population of three different types of iris flowers.
On the left hand side, each of the four variables is scatter-plotted versus each of the other three variables. The different types of the flowers are plotted as different colors.
The right hand side is the result of using PCA to map the four original variables to one variable. On the right hand side, the component scores of the original data is shown, using only the first principal component from the iris data set. The first principal component maintains 92% of the variability of the original data.
9. PCA in R
Creating a PCA model in R uses the 'prcomp' function. The first parameter this function is the original data, one observation per row, one feature per column, as in other machine learning applications.
The scale parameter indicates if the data should be scaled to standard deviation of one (1) before performing PCA -- more on this in a later video. The center parameters indicates if the data should be centered around zero (0) before performing PCA. I highly recommend also leaving this parameter as TRUE.
Finally, performing summary on the output of 'prcomp' Indicates the variance explained by each principal component, the cumulative proportion of variance explained as each principal component is added to the previous principal component, and the standard deviations of the principal components.
Again, you are encouraged to look at the R documentation for the prcomp() function when you are ready to dig into the options and results of this function more.
10. Let's practice!
Ok, enough, time to create your first principal components model.