Principle components analysis (PCA)

1. Principle components analysis (PCA)

Principle components analysis (or PCA) is one of my favorite preprocessing steps for linear regression models. You'll notice that I used it as an example in many of the previous videos.

2. Principle components analysis

PCA is incredibly useful because it combines all the low-variance and correlated variables in your dataset into a single set of high-variance, perpendicular predictors. As we saw before, low variance variables can be problematic for cross-validation, but can also contain useful information. It's better to find a systematic way to use that information, rather than throw it away. Furthermore, perpendicular predictors are useful because they are perfectly uncorrelated. Linear regression models have trouble with correlation between variables (also known as collinearity), and PCA elegantly removes this issue from the equation.

3. PCA: a visual representation

PCA searches for high-variance linear combinations of the input data that are perpendicular to each other. The first component of PCA is the highest variance component, and is the highest variance axis of the original dataset. The second PCA component has the second highest variance, and so on. This diagram illustrates how PCA works. We have 2 correlated variables, x and y. When plotted together, we can see their relationship. PCA transforms the data with respect to this correlation, and finds a new variable (the long diagonal arrow pointing up and to the right) that reflects the shared correlation of x and y. After finding the first PCA component, the second PCA component is constrained to be perpendicular, and is the second arrow going up and to the left. In other words, the first PCA component reflects the similarity between x and y, while the second PCA component emphasizes the difference between x and y. This idea is easy to illustrate in 2 dimensions, but also extends to multiple dimensions.

4. Example: blood-brain data

Let's take a look at the blood-brain dataset, which contains lots of predictors, many of which are low-variance. We can use the nearZeroVar function from the caret package to identify these variables.

5. Example: blood-brain data

We can start by just removing the zero variance predictors from the dataset with the "zv" argument, prior to modeling. This yields some warnings, but no error, and our models run successfully.

6. Example: blood-brain data

Next, we can try removing low variance variables with the "nzv" argument. This gets rid of all the warnings and yields slightly better accuracy.

7. Example: blood-brain data

Finally, we can do PCA on the full dataset, removing only the zero-variance predictors, which contain no information. This gives the best results, because we include the low-variance predictors in the model, but combine them together in an intelligent way using PCA.

8. Let’s practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.