1. Removing redundant features
Feature selection's main goal is to remove unnecessary features from our dataset that might create noise when modeling, so let's talk about redundant features.
2. Redundant features
One of the easiest ways to determine if a feature is unnecessary is to check if it is redundant in some way. For example, if it exists in another form as another feature, or if two features are very strongly correlated. Sometimes, when you create features through feature engineering, you end up duplicating existing features in some way. Some redundant features can be identified manually, by simply having an understanding of the features in our dataset. It should be noted that, like the machine learning process in general, feature selection is an iterative process. We might try removing some features only to find it doesn't improve our model's performance, and we might have to reassess our selection choices.
3. Scenarios for manual removal
There are a variety of scenarios in which manually removing features makes sense. The first is if our dataset contains repeated information in its feature set. For example, we may see columns related to city, state, latitude and longitude in the same dataset. Perhaps, for our modeling task, using latitude and longitude is specific enough, or perhaps we only need the high-level state information. Or, a dataset might contain if an animal is a dog or a cat and its specific breed. We might want to drop one or the other, depending on the end goal.
Another scenario occurs through feature engineering. If we applied feature engineering to extract numbers from a text feature, it's unlikely that we'd need to keep the original text feature. If we took an average to use as an aggregate statistic, it's likely that we could drop the values that generated that aggregate statistic.
4. Correlated features
Another clear situation in which we'd want to drop features is when they are highly statistically correlated, meaning they move together directionally. Linear models in particular assume that features are independent of one other, and if features are strongly correlated, that could introduce bias into the model. Let's use Pearson's correlation coefficient to check a feature set for correlation. The Pearson correlation coefficient is a measure of this directionality: a score closer to 1 between pairs of features means that they move together in the same direction more strongly, a score closer to 0 means features are not correlated, and a score close to -1 means they are strongly negatively correlated, meaning one feature increases in value while the other decreases. We can calculate the Pearson correlation coefficients for each pair of features using pandas.
5. Correlated features
Here we have a dataset with some numerical values. To check the correlations within a dataset, we call the corr method on the DataFrame. This returns the correlation scores between each pair of features in the dataset. We can see that features A and B score close to 1, so we should likely drop one of those features.
6. Let's practice!
Now it's your turn to check for redundant features.