Choosing the right number of principal components

1. Choosing the right number of principal components

So far, in total there have been 16 components extracted which equals the number of variables. Since the goal of PCA is dimension reduction, we have to decide on how many components we want to keep. On the one hand, we don't want to keep too many components, but on the other hand, we want to cover a sufficiently large amount of the original variance. Since there are different approaches to make this decision, it's often useful to use several of them and find an answer somewhere in the middle.

2. No. relevant components: explained variance

One way to decide on the number of components is to set a minimum of the overall variance explained. We extract this using the `summary()` function on the `prcomp` object. One, rather arbitrary threshold, is to choose a proportion of about 70%. The minimum number of components for which the `Cumulative Proportion` exceeds 0.7 is five, in this case.

3. No. relevant components: Kaiser-Guttman criterion

Another criterion is called the Kaiser-Guttman criterion. Here, you only keep components with an eigenvalue larger than 1. The justification sounds reasonable: An eigenvalue smaller than 1 means that the component covers less variance than a single variable contributed. Consequently, the respective component does not really help to reduce dimensionality. The Kaiser-Guttman criterion suggests to choose six components. Let's see if the next method is in line with that.

4. No. relevant components: screeplot

The screeplot is a graphical method to decide on the number of relevant components. It is drawn by the `screeplot()` function from the `stats` package. We use the function `box()` in order to draw a box around the plot. Then, we draw a horizontal line at a variance of one using the `abline()` function. The screeplot displays the variances (that is, the eigenvalues) of all components in descending order. You have to look for an elbow here and drop all components to the right of the elbow. In this screeplot, we would keep six components (or maybe only two). To sum up, the variance-explained-criterion suggested five components, Kaiser-Guttman six and the screeplot six as well, so let's go for six!

5. Suggested number of components by criterion

6. The biplot

There is a nice plot that visualizes how the variables and the components behave with respect to each other - the so-called biplot. The axes in the plot are made up from two principal components, the arrows indicate the variables and the numbers indicate single observations. For this plot, we chose the components 1 and 2 via the `choices` argument, and scaled down the text size with a factor of 0.7 to get a better overview (that's the `cex` argument). We see that the first principal component covers the most variance of the observations - they are scattered most with respect to the x-axis. The datacloud is already less scattered with respect to the direction of the second principal component, the y-axis. If we could grasp a plot with even more dimensions, this pattern would be repeated with each additional component. If an arrow is nearly parallel to a component, it means that this variable loads high on the respective component.

7. Hands on!

Now let's move on to some exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.