1. Choosing the number of components
In this video, we will focus on the important task of choosing the number of components to retain.
2. Summary of princomp object
Recall, that the princomp() function gave us 9 PCs for the 9-dimensional mtcars dot sub dataset. Also, notice that the leading PCs explain a major chunk of the variation present in the dataset. So, if want to use fewer than 9 variables, one option is to choose the first few PCs that explain the majority of the variation in the data.
3. Using the scree plot
There are two methods for choosing the number of components to retain. One is based on the proportion of variation explained by each component which can be visualized with the screeplot() function.
The ideal pattern is a steep curve, followed by a bend, and then a straight line. The steep curve shows that the first few PCs are extremely important compared to the trailing PCs that make up the straight line portions.
Thus, we should retain all the components in the steep part of the curve and ignore the PCs corresponding to the flat line.
Here, we could choose 4 components, since the line before that component is steep, and after component 4 the line is relatively flat.
4. Cumulative variance explained
The second method is designed to achieve a predetermined value for the cumulative proportion of variation explained by the chosen components.
Recall, when we applied the summary function on cars.pca it gave us the cumulative proportion of variation explained in the third row.
Let’s find out how many components are needed to explain 90 percent of the variation. Here the answer is 3 since 2 components explain less than 90 percent and 3 components explain more than 90 percent of the variation.
It is tedious to read off these numbers from the R output, so instead, for this second method, we will use a graphical technique based on the output of the cumulative proportion row.
5. Calculating cumulative proportional variance
Unfortunately, there is no easy way to extract the cumulative proportional variation explained values from the output of the summary function. So we need to calculate them directly from the princomp cars dot pca object.
First, we calculate the variance each component explains by squaring cars dot pca dollar sign sdev, which contains the standard deviation of each component. Then we calculate the proportional variance.
Applying the cumsum() function on the proportional variation, we can obtain the cumulative variance, which should match the third row of the summary object.
If we want to find out how many components are needed to explain the same 90 percent of the variation, we can plot the cumulative variance values and use the abline() function to add a horizontal line to the plot at 0 point 9.
6. Calculating cumulative proportional variance
The first point on the cumulative proportional variation explained graph above the 0 point 9 line corresponds to 3 components. This tells us that we need at least three components to explain 90 percent of the variation.
7. Let's practice using these techniques!
Now it's your turn to use these techniques to choose the number of components.