1. Feature selection
Once we've settled on a feature set for modeling, it's important to really consider these features. Do we need all of them, and do we know how they will impact the model?
2. What is feature selection?
Feature selection is a method of selecting features from the feature set to be used for modeling. It draws from a set of existing features, so it's different than feature engineering because it doesn't create new features. The overarching goal of feature selection is to improve the model's performance. Perhaps our existing feature set is much too large, or some of the features we're working with are unnecessary.
There are different ways we can perform feature selection. It's possible to do it in an automated way. scikit-learn has several methods for automated feature selection, such as choosing a variance threshold and using univariate statistical tests, but we won't cover these here. Most of the methods we'll cover in this chapter are more on the manual side, because it's important to truly understand our dataset before using it to train a model.
3. When to select features
In this chapter, we'll cover three specific scenarios for feature selection. Sometimes, it helps to git rid of noise in our model. Maybe we have redundant features, like having both latitude and longitude and city and state as geographical features, which can add noise. Or maybe you have features that are strongly statistically correlated, which breaks the assumptions of certain models and impacts model performance. If we're working with text vectors, we'll want to use those tf-idf vectors to determine which set of words to train our model on. And finally, if our feature set is large, it may be beneficial to use dimensionality reduction to combine and reduce the number of features in our dataset in a way that also reduces the overall variance.
4. Let's practice!
Time to test your knowledge about feature selection.