1. Feature engineering
In this chapter, we're going to talk about a very important part of the preprocessing workflow: feature engineering.
2. What is feature engineering?
Real-world data is often not neat and tidy, and in addition to preprocessing steps like standardization, we'll likely have to extract and expand information from existing features.
Feature engineering is the creation of new features based on existing features, and it adds information to the dataset that can improve prediction or clustering tasks, or adds insight into relationships between features. DataCamp has dedicated courses on feature engineering, so in this chapter, we'll just focus on the key components for preprocessing.
There are automated ways to create new features, but for now, we're going to cover manual methods of feature engineering. These methods require us to already have an in-depth knowledge of the dataset that we're working with. Feature engineering is also something that is very dependent on the particular dataset you're analyzing.
The goal for this chapter is to demonstrate some scenarios where feature engineering can be useful.
3. Feature engineering scenarios
There are a variety of scenarios where we might want to engineer features from existing data. An extremely common one is with text data. For example, if we're building some kind of natural language processing model, we'll have to create a vector of the words in our dataset. Another scenario might also be related to string data: maybe we have a column of people's favorite colors. In order to feed this information into a scikit-learn model, we'll have to encode this information numerically.
4. Feature engineering scenarios
Another common example is with timestamps. We might see a full timestamp that includes the time down to the second or millisecond, which might be much too granular for a prediction task, so we can create a new column that contains the day or the month component. Some columns can also contain a list of some kind, such as test scores, or running times, and maybe it's more useful to use an average. These are all examples of situations where we'd want to generate new features from existing columns.
5. Let's practice!
Let's take a look at a dataset to determine where feature engineering might be useful.