Tree-based feature selection
1. Tree-based feature selection
Some models will perform feature selection by design to avoid overfitting.2. Random forest classifier
One of those, is the random forest classifier. It's an ensemble model that will pass different, random, subsets of features to a number of decision trees. To make a prediction it will aggregate over the predictions of the individual trees. The example forest shown here contains four decision trees. While simple in design, random forests often manage to be highly accurate and avoid overfitting even with the default Scikit-learn settings.3. Random forest classifier
If we train a random forest classifier on the 93 numeric features of the ANSUR dataset to predict gender, its test set accuracy is 99%. This means it managed to escape the curse of dimensionality and didn't overfit on the many features in the training set.4. Random forest classifier
In this illustration of what the trained model could look like, the first decision tree in the forest used the neck circumference feature on its first decision node and hand length later on to determine if a person was male or female. By averaging how often features are used to make decisions inside the different decision trees, and taking into account whether these are important decisions near the root of the tree or less important decisions in the smaller branches of the tree, the random forest algorithm manages to calculate feature importance values.5. Feature importance values
These values can be extracted from a trained model using the feature_importances_ attribute. Just like the coefficients produced by the logistic regressor, these feature importance values can be used to perform feature selection, since for unimportant features they will be close to zero. An advantage of these feature importance values over coefficients is that they are comparable between features by default, since they always sum up to one. Which means we don't have to scale our input data first.6. Feature importance as a feature selector
We can use the feature importance values to create a True/False mask for features that meet a certain importance threshold. Then, we can apply that mask to our feature DataFrame to implement the actual feature selection.7. RFE with random forests
Remember dropping one weak feature can make other features relatively more or less important. If you want to play safe and minimize the risk of dropping the wrong features, you should not drop all least important features at once but rather one by one. To do so we can once again wrap a Recursive Feature Eliminator, or RFE(), around our model. Here, we've reduced the number of features to six with no reduction in test set accuracy. However, training the model once for each feature we want to drop can result in too much computational overhead.8. RFE with random forests
To speed up the process we can pass the "step" parameter to RFE(). Here, we've set it to 10 so that on each iteration the 10 least important features are dropped. Once the final model is trained, we can use the feature eliminator's .support_ attribute as a mask to print the remaining column names.9. Let's practice!
Now its your turn to perform feature selection with random forests.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.