1. Feature engineering
Once we know the properties of the input data and have a reliable validation scheme, it's time to start building prediction models.
2. Solution workflow
Recall from the previous chapter the solution workflow for the competitions. We've already covered the first three blocks. Let's now consider the modeling stage.
3. Modeling stage
This stage is the longest one in the competition, and kind of feels like a marathon.
4. Modeling stage
During the modeling loop we pre-process data, create new features, enhance models, apply different tricks and iterate over and over again.
The majority of the ideas and experiments will not work, but the goal is to find a subsample of actions which improve both local validation and Public Leaderboard scores.
5. Modeling stage
So, after any change we should look at the validation score. If we observe an improvement on local validation, then we keep our change, otherwise, discard it.
The important rule is to tweak only a single thing at a time, because changing multiple things does not allow us to detect what actually works and what doesn't.
6. Feature engineering
This particular chapter is devoted to feature engineering. It is the process of creating new features. It helps our Machine Learning models to get the additional information and consequently to better predict the target variable.
7. Feature engineering
The ideas for new features can come from prior experience working with similar data.
Another source is EDA. Having looked at the data, we could potentially generate ideas for new valuable features.
One more source is domain knowledge of the problem we're solving. It allows us to use ideas and approaches that work for this particular domain.
8. Feature types
There is a number of different feature types. The most popular include:
Numerical features. It's usual numbers, measures and counts. For example, price, number of bedrooms and so on.
Categorical features. It's some group the observation belongs to. For example, country names, marital status and so on.
Date features include various date and time information.
Coordinates describe geospatial data.
Text features contain different descriptions, addresses and so on.
Finally, images include some visual data for each observation.
9. Creating features
There are some situations when we need to generate features for train and test independently and for each validation split in the k-fold cross-validation. However, in the majority of cases features are created for train and test sets simultaneously.
For this purpose, we concatenate train and test DataFrames into a single DataFrame using pandas' .concat() method.
Then we generate some new features.
And split our DataFrame back to the train and test. We could use the .isin() method to find the original train and test ids, respectively.
10. Arithmetical features
The simplest engineered features are arithmetical features. We just take two numerical features, apply arithmetical operations to them and obtain new features.
Let's consider a subsample from two sigma connect dataset with only number of bathrooms and bedrooms in the apartments, together with the price.
Then, for example, we could generate such features as price per one bedroom. Or the overall number of bedrooms and bathrooms. And so on.
11. Datetime features
Another type of the data we will speak about in this lesson, is datetime.
Let's look at the demand forecasting data. It contains item sales for each date.
To generate features from this date, firstly, we convert the date column to datetime object using pandas' to_datetime() method.
Then, we could use the .dt attribute and obtain any date feature we'd like.
12. Datetime features
For example, we could start with the year number. Using .dt attribute and proceeding with .year attribute.
Then, for example, month number. January is encoded as 1, February as 2 and so on to December encoded as 12.
We can also get a consecutive number of the week during the year.
And various possibilities for day features. Like a consecutive number of the day during the year, month and week. Note that day of the week encodes Monday as 0, Tuesday as 1 proceeding to Sunday as 6.
13. Let's practice!
All right, let's get some practical experience creating new numerical and date features!