1. Feature Generation
In this video, we will learn a lot about the nuts and bolts of feature engineering. Just because it's called 'machine learning' doesn't mean that it can figure everything out on its own. So we will use some tricks to help it out by creating new features that will better capture patterns in the data. This video will cover feature generation and show you how using the new features can improve a model.
2. Why generate new features?
Why generate new features if the information is already available in the dataset? Combining features together can capture subtle dependent effects between them that impact the outcome variable. These can be represented by multiplying, summing, differencing or dividing two or more variables.
3. Why generate new features?
To see the impact of generating these features let's suppose you have two attributes, length, and width and the price of single story home. If these are your only two features how can you best create a model to predict price? They certainly don't look to be very strong features as is.
4. Combining Two Features
Taking the previous example a step further we can think about how a person might buy a home. If we use some intuition that people often consider the area of a home, we can create a new feature, Total Square Footage by multiplying the width and length. The results are much better with an R-squared of point-81! Applying your reasoning and understanding of the problem can help you build powerful predictors.
5. Other Ways to Combine Two Features
Our dataset doesn't include WIDTH and LENGTH because no one would ever actually look for a house that way. However, we don't have Total Square Footage calculated but we can create it using withColumn and by adding SQFTBELOWGROUND and SQFTABOVEGROUND together.
We can build another feature, PRICEPERTSQFT using our previous feature TSQFT. This is now the combination of three independent variables. There isn't a limit to how deep you can go but interpretability of what it means starts to become difficult after three.
We can create DAYSONMARKET as a difference between LISTDATE and OFFMARKETDATE. We will cover how to get LISTDATE and OFFMARKETDATE into the datetime format in the next section but for now, know that new features can be generated many ways!
6. What's the limit?
There is a major push in the data science community to automate some of the generation of features. If this is of interest to you, I'd recommend you check out the Python libraries FeatureTools and TSFresh.
I will caution you that simply multiplying each feature pairwise will square your number of features. This can cause an explosion of features that can be unwieldy to model or could potentially overfit your model by pure coincidence. Many of the features may convey similar information and won't be needed
Lastly, there is no limit to how many features you can combine but the interpretability certainly takes a steep dive after three. Beyond this is the realm of deep feature generation, a topic for another course!
7. Go forth and combine!
In this video, you learned that you can generate new powerful features to represent complex relations between them. Lastly, you saw that features combinations are everywhere and many are already in our dataset. It's your turn to take what you learned to build and evaluate new features generated from what's available!