Feature engineering and overfitting

1. Feature engineering and overfitting

Feature engineering uses domain knowledge and common sense to describe an object with numbers. Although adding more features can improve performance, it can also increase the risk of overfitting. In this lesson, you will learn more about this interesting trade-off.

2. Feature extraction from non-tabular data

Sometimes, the raw data can not fit into the form of a table. For example, consider electrocardiogram (or ECG) traces for a number of individuals. Each ECG trace is a time series, possibly of variable length, that cannot fit in one cell of a table. Instead, in the dataset shown here experts extracted over 250 one-dimensional numerical summaries from each ECG. These range from simple summaries like heart-rate to very complex properties of the signal with weird names like T-wave-amp, all of which can be useful in detecting a medical condition known as arrhythmia.

3. Label encoding for categorical variables

Even if the data are tabular, some of the columns might be non-numeric. Here is an example from the credit scoring dataset: the purpose of the loan takes values such as "buy a new car", "education" or "retraining". LabelEncoder will map these values onto a range of numbers.

4. Label encoding for categorical variables

But the classifier is then confused. It thinks that the categories have a natural ordering. For example, a decision tree might try to split the range in two. If it splits at 4, it is putting loans for business together with loans for a microwave oven!

5. One hot encoding for categorical variables

A different approach is to use one-hot-encoding, implemented by the .get_dummies() pandas method. This creates one new dummy variable for each category, taking the value 1 for each example that falls in that category and 0 otherwise. You can see the first row of the data on the left, printed vertically for readability. No artificial ordering is introduced.

6. Keyword encoding for categorical variables

How about capturing semantic similarity? Notice that similar categories share keywords: for example, all consumer loans feature the keyword "buy". You can count common keywords using CountVectorizer from the feature_extraction module. First, replace underscores with spaces for easier tokenization. Then, apply the encoder using its .fit_transform() method. Finally, convert the resulting matrix to a DataFrame, naming the columns using the .get_feature_names() method of the CountVectorizer object.

7. Dimensionality and feature engineering

Note that as we improve our feature engineering pipeline, the dimension of our DataFrame increases! The question arises: how many features is too many?

8. Dimensionality and overfitting

Well, with more columns, the algorithm has more opportunity to mistake coincidental patterns for real signal. We can test this by adding columns to the data containing purely random numbers totally unrelated to the class. As we add more columns on the horizontal axis, overfitting kicks in! Accuracy improves in-sample but deteriorates out-of-sample.

9. Feature selection

A popular solution is to add features freely, and then select the "best" ones using some feature selection technique. Let's try the trick from the previous slide, and augment the credit scoring dataset with 100 fake variables.

10. Feature selection

Then, we use the SelectKBest algorithm from the feature_selector module to select the 20 highest-scoring columns. We use the chi2 scoring method. The feature selector has a .fit() method to fit it to the data, and a .get_support() method that returns the index of the selected columns. Thankfully, only a handful of fake columns remain in the selected features.

11. Tradeoffs everywhere!

So remember this: every decision you make in your pipeline might affect other aspects of it, and in particular the risk of overfitting. The following exercises confirm this insight.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Designing Machine Learning Workflows in Python

AdvancedSkill Level

4.8+

74 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks