Get startedGet started for free

PCA applications

1. PCA applications

When you use PCA for dimensionality reduction you decide how much of the explained variance you're willing to sacrifice. However, one downside of PCA is that the remaining components can be hard to interpret.

2. Understanding the components

To improve your understanding of the components it can help to look at the components_ attribute. This tells us to what extent each component's vector is affected by a particular feature. The features with the biggest positive or negative effects on a component can then be used to add a meaning to that component. In the example shown here the effects of the features on the first components are positive and equally strong at 0-point-71. So the first component is affected just as much by hand as foot length. However, the second component is negatively affected by hand length. So people who score high for the second component have short hands compared to their feet.

3. PCA for data exploration

When we apply this technique to the combined male-female ANSUR dataset we find that the first component is mostly affected by overall body height. To verify this I've added body height categories to the data. When we plot the first two components and color the points with these categories, we can see that they pretty much align with the X-axis or first principal component. We therefore learn that the most important source of variance in this dataset, has something to do with how tall a person is.

4. PCA in a pipeline

Let's look at the code to create this plot. Since we always scale the data before applying PCA we can combine both operations in a pipeline. We pass the two operations to the Pipeline() class in the form of two tuples inside a list. Within each tuple we give our operation a name, 'scaler' and 'reducer' in this example, and then fit and transform the data in one go.

5. Checking the effect of categorical features

Our ANSUR dataset has a number of categorical features. PCA is not the preferred algorithm to reduce the dimensionality of categorical datasets, but we can check whether they align with the most important sources of variance in the data.

6. Checking the effect of categorical features

We can add the first two principal components to our DataFrame and plot them with Seaborn's scatterplot(). To create the plot we saw earlier we set the hue parameter to 'Height_class'. Now that we know that tall individuals are on the left and shorter individuals on the right, let's have a look at how Gender is associated with the variance.

7. Checking the effect of categorical features

It turns out females are mostly on the right, shorter side of our point cloud.

8. Checking the effect of categorical features

When we use the BMI class to color the points we see that this feature is mostly aligned with the second principal component while also the first component has an effect.

9. PCA in a model pipeline

To go beyond data exploration, we can add a model to the pipeline. In this case, we've added a random forest classifier and will predict gender on the 94 numeric features of the ANSUR dataset. Notice that we've told the PCA class to only calculate 3 components with the n_components parameter. We can access the different steps in the pipeline by using the names we gave each step as keys, just like a Python dictionary. When we use 'reducer' as the key, the PCA object is returned.

10. PCA in a model pipeline

Once the pipeline has been fitted to the data, we can access attributes like the explained_variance_ratio_ like so. When we sum these, we see that the first three components only explain 74% of the variance in the dataset. However, when we check the classification accuracy on the test set, we get an impressive 98-point-6%!

11. Let's practice!

Now it's your turn to apply PCA for data exploration and modeling.