1. Feature selection and engineering
Datasets often have features that provide no predictive power and need to be dropped prior to modeling.
2. Dropping unnecessary features
An example of features that can be dropped are unique identifiers such as phone numbers, social security numbers, and account numbers. Pandas DataFrames have a drop method that you can use to drop columns, as shown here.
3. Dropping correlated features
Features that are highly correlated with other features can also be dropped, as they provide no additional information to the model.
4. Feature correlation (Part 1)
The corr method allows you to explore the correlation between the features in your dataset. In our telco DataFrame, notice how
5. Feature correlation (Part 2)
Day Minutes,
6. Feature correlation (Part 3)
Evening Minutes,
7. Feature correlation (Part 4)
Night Minutes,
8. Feature correlation (Part 5)
and International Minutes are highly correlated with
9. Feature correlation (Part 6)
Day Charge,
10. Feature correlation (Part 7)
Evening Charge,
11. Feature correlation (Part 8)
Night Charge,
12. Feature correlation (Part 9)
and International Charge, respectively. Intuitively, it makes sense that these features should be correlated, and from a modeling standpoint, we can improve the performance of our models by removing these redundant features.
This process of choosing which features to use in your model is known as feature selection.
13. Feature engineering
Besides selecting which features to use, you'll often also need to create new features to help improve model performance. This is known as feature engineering. Consulting with the business and subject matter experts can lead to additional features, and should be a crucial step for every data science workflow. This is no exception for churn models. Together with feature selection, feature engineering is a critical step that can add a lot of value to your final model.
14. Examples of feature engineering
One example of a new feature you could create is Total Minutes, which combines Day Minutes, Evening Minutes, Night Minutes, and International Minutes. Or you could create a new feature that is the ratio between Minutes and Charge. There's really no limit, and a lot of feature engineering comes down to understanding your domain and dataset really well.
You can create new features using pandas techniques that you are already familiar with. As an example, in this code here, we create a new feature called Day Cost, which is the ratio between Day Minutes and Day Charge.
15. Let's practice!
In the exercises, you'll have the opportunity to put feature selection and feature engineering into practice. Enjoy!