Feature extraction
1. Feature extraction
Welcome to the final chapter of this course where we'll be looking into feature extraction! This comes down to calculating new features based on the existing ones while trying to lose as little information as possible.2. Feature selection vs. extraction
In the previous two chapters we looked into feature selection where some features were simply dropped completely.3. Feature selection vs. extraction
Feature extraction is different in the sense that it creates new features, which are in fact combinations of the original ones. There are powerful algorithms that will calculate the new features in a way that as much information as possible is preserved, but before we get into those, let's look at more simple feature extraction. When you have good understanding of the features in your dataset you can sometimes combine multiple features into a new feature that makes the original ones obsolete.4. Feature generation - BMI
Take for instance the body mass index or BMI we've worked with before. Its a measure on whether a person is under- or overweight regardless of their height; and can be calculated by dividing a person's weight by the square of the height.5. Feature generation - BMI
If we build a diabetes model on this data the height and weight features by themselves might be obsolete once we have the BMI.6. Feature generation - BMI
And we could drop them from the dataset with the drop method to reduce dimensionality.7. Feature generation - averages
Imagine that our body measurement dataset would have measurements of both left and right leg lengths. For most applications it would be sufficient to reduce these two features into a single leg length feature. We could create such a feature with the DataFrame's .mean() method with the axis argument equal to one8. Feature generation - averages
and then once again drop the original features.9. Cost of taking the average
Taking the average of two features comes with the cost of losing some information. In this case the cost is small since the features are so similar, but let's zoom in on this data to identify it.10. Cost of taking the average
We can now see the differences between both features more clearly.11. Cost of taking the average
When we add a line for where the two features are equal it becomes easy to identify people with different leg lengths.12. Cost of taking the average
The cost of taking the average is then that the average leg length for the three people in the red rectangle is the same. You lose the information on the difference in leg lengths.13. Intro to PCA
Now let's take a step back and look at a different data sample with hand lengths versus feet lengths. Instead of taking the mean of both features we'll explore an alternative technique.14. Intro to PCA
For this technique it's important to scale the features first, so that their values are easier to compare. We do this with sklearn's StandardScaler(). The strongest pattern in this dataset is that people with big feet also tend to have big hands.15. Intro to PCA
What we could do is add a reference point to the very center of the point cloud,16. Intro to PCA
and then point a vector in the direction of this strongest pattern. People with a positive value for this vector have relatively long hands and feet, and people with a negative value have relatively short ones.17. Intro to PCA
We could add a second vector perpendicular to the first one to account for the rest of the variance in this dataset. People with a positive value for this second vector have relatively long feet compared to their hand length and people with a negative value have relatively big hands. Every point in this dataset could be described by multiplying and then summing these two perpendicular vectors. We've essentially created a new reference system aligned with the variance in the data. The coordinates that each point has in this new reference system are called principal components, and they are the foundation of principal component analysis or PCA which is the main topic of this chapter.18. Let's practice!
Now it's your turn to generate some features.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.