Get startedGet started for free

Feature extraction

1. Feature extraction

Welcome to the final chapter of this course where we'll be looking into feature extraction! This comes down to calculating new features based on the existing ones while trying to lose as little information as possible.

2. Feature selection vs. extraction

In the previous two chapters we looked into feature selection where some features were simply dropped completely.

3. Feature selection vs. extraction

Feature extraction is different in the sense that it creates new features, which are in fact combinations of the original ones. There are powerful algorithms that will calculate the new features in a way that as much information as possible is preserved, but before we get into those, let's look at more simple feature extraction. When you have good understanding of the features in your dataset you can sometimes combine multiple features into a new feature that makes the original ones obsolete.

4. Feature generation - BMI

Take for instance the body mass index or BMI we've worked with before. Its a measure on whether a person is under- or overweight regardless of their height; and can be calculated by dividing a person's weight by the square of the height.

5. Feature generation - BMI

If we build a diabetes model on this data the height and weight features by themselves might be obsolete once we have the BMI.

6. Feature generation - BMI

And we could drop them from the dataset with the drop method to reduce dimensionality.

7. Feature generation - averages

Imagine that our body measurement dataset would have measurements of both left and right leg lengths. For most applications it would be sufficient to reduce these two features into a single leg length feature. We could create such a feature with the DataFrame's .mean() method with the axis argument equal to one

8. Feature generation - averages

and then once again drop the original features.

9. Cost of taking the average

Taking the average of two features comes with the cost of losing some information. In this case the cost is small since the features are so similar, but let's zoom in on this data to identify it.

10. Cost of taking the average

We can now see the differences between both features more clearly.

11. Cost of taking the average

When we add a line for where the two features are equal it becomes easy to identify people with different leg lengths.

12. Cost of taking the average

The cost of taking the average is then that the average leg length for the three people in the red rectangle is the same. You lose the information on the difference in leg lengths.

13. Intro to PCA

Now let's take a step back and look at a different data sample with hand lengths versus feet lengths. Instead of taking the mean of both features we'll explore an alternative technique.

14. Intro to PCA

For this technique it's important to scale the features first, so that their values are easier to compare. We do this with sklearn's StandardScaler(). The strongest pattern in this dataset is that people with big feet also tend to have big hands.

15. Intro to PCA

What we could do is add a reference point to the very center of the point cloud,

16. Intro to PCA

and then point a vector in the direction of this strongest pattern. People with a positive value for this vector have relatively long hands and feet, and people with a negative value have relatively short ones.

17. Intro to PCA

We could add a second vector perpendicular to the first one to account for the rest of the variance in this dataset. People with a positive value for this second vector have relatively long feet compared to their hand length and people with a negative value have relatively big hands. Every point in this dataset could be described by multiplying and then summing these two perpendicular vectors. We've essentially created a new reference system aligned with the variance in the data. The coordinates that each point has in this new reference system are called principal components, and they are the foundation of principal component analysis or PCA which is the main topic of this chapter.

18. Let's practice!

Now it's your turn to generate some features.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.