Get startedGet started for free

Practical issues with PCA

1. Practical issues with PCA

Before PCA is wrapped up there are some practical issues of using PCA on real world data that will be helpful to know.

2. Practical issues with PCA

There are three types of items that need to be considered to complete a successful principal components analysis. The first of these, is dealing with scaling the data. I'll cover this in more detail in just a moment. The second item that sometimes needs to be considered is what to do with observations that have missing data in one or more of the features. There are many ways to address this issue, with one of the simplest approaches being to not include, or drop, observations with missing data. A more complex approach to dealing with missing data is to estimate or impute the missing values. While I will not go into more detail on this item, I wanted you to be aware of these strategies. The third practical matter is how to handle observations with features that are categories -- that is, features that are not numerical. The first strategy is the simplest -- do not include the categorical features in modeling. The second strategy is more involved and requires using one of many methods to encode the categorical features as numbers. This is more detail than I want to cover in this introductory course, but wanted to make sure you were aware of them if the situation presents itself.

3. mtcars dataset

Now let's dig into the importance of scaling. Here we will look at the mtcars dataset. This dataset has information about various car models, things like miles per gallon, horsepower, and number of cylinders. Each feature is in different units of measure.

4. Scaling

As with clustering, when features are measured in different units or scales, it is often required to center the data by subtracting the means of each feature, and dividing each feature by its standard deviation to normalize the data. Here it's shown that the means of the features and the standard deviations of the features vary quite a bit, indicating that centering and scaling the data are in order before performing PCA.

5. Importance of scaling data

A way to understand the importance of scaling the data is to review the biplots of the mtcars data with and without scaling. In the example on the left hand side, without scaling, the displacement and horsepower features are the features with the largest loadings in the first two principal components, with all the other features being overwhelmed. This is because those two features, displacement and horsepower, have the most variance in the data... but that is true only because they are on a different unit of measure from the other features. On the right hand side is the same original data on a biplot but with scaling performed before doing PCA. This biplot now shows a more even distribution of the loading vectors.

6. Scaling and PCA in R

Unlike the clustering algorithms, prcomp in R has the option to perform scaling and centering directly in the principal components algorithm. There are two parameters, center, and scale, used to perform this in R. Setting these to TRUE or FALSE will perform or not perform scaling and centering.

7. Let's practice!

The coming exercises will help you practice what you've learned.