Get startedGet started for free

What is feature engineering?

1. What is feature engineering?

Hi. My name is Jorge Zazueta, and I will be your instructor for this course. My work involves building models to understand social phenomena and solve business problems. Let us explore what feature engineering is and why it is crucial in machine learning and modeling in general.

2. What is feature engineering?

Feature engineering is the art and science of creating, transforming, extracting, and selecting variables to improve model performance and interpretability. Some times, feature engineering can have a greater impact on final accuracy than model selection, and a well-approached feature engineering process might save a lot of time. Consider the table to the right. It contains 100 instances of the height at a hundred points in time of an object thrown into the air. Let's try modeling height using simple linear regression.

3. Why engineer features?

A naive way to approach our current problem is to create a linear regression model of height as a function of time, using the base R lm() function and setting our formula to predict height in terms of time from the height dataset. We then create a new dataset, df, by binding the columns of height and our predicted values. With df, we can use ggplot to visualize it to assess the accuracy our predictions against the data to see how miserably our model fails! From the graph to the right is evident that the behavior of the data is non-linear, rendering our model useless. We have two choices from here: We can try a different model, such as a Support Vector Machine or a Random Forest, or we can create a new variable that better reflects the relationship between these two variables.

4. Using mutate()

Someone familiar with physics would tell us that the height of a projectile depends not only on time but on the square of time, as indicated in this formula. Armed with this idea, we can create a new feature: time squared, and build our model around it. Enter the mutate() function. mutate() takes a data frame as a first argument and the definition of a new variable to be added to the data frame. In our case, we are defining the new variable time_2 as the square of the original time variable and saving it to df_2.

5. Predict using the engineered feature

Now that we have the squared time variable in our data frame, we can create a new linear regression model that recognizes the dependency on time_2. As we did before, we can plot our new predictions along with the original data to visually assess our improved model. This new fit is an impressive result! And we were able to achieve it without a more sophisticated model but simply by engineering a new, more informative feature from our raw data and manually implementing it using mutate().

6. Let's practice!

Let us try it out.