Get startedGet started for free

Feature engineering

1. Feature engineering

In this chapter, we will explore the recipes package and feature engineering, which is the process of transforming data to a format that is suitable for machine learning algorithms.

2. Feature engineering with the recipes package

Feature engineering is accomplished with the recipes package. It is designed to help with all stages of feature engineering, which include assigning variable roles to the columns of our data, defining preprocessing tasks and data transformations, training our data transformations, and applying them to new data sources.

3. Specifying variable types and roles

The first step in feature engineering is assigning each column in our data to either an outcome or predictor role and determining their data type, which can be either numeric or categorical. The recipes() function is used for this task.

4. Data preprocessing steps

The next step involves defining a sequence of data preprocessing steps, which can include missing data imputation, centering and scaling numeric variables, creating new variables from ratios of existing variables, and many more possibilities. These transformations are encoded with unique step_*() functions.

5. Training preprocessing steps

After preprocessing steps are defined, they need to be trained and estimated with data. This includes things such as calculating the mean and standard deviation of numeric columns for centering and scaling data and storing formulas for creating new columns. The prep() function is used for this task.

6. Applying recipes to new data

The final step of feature engineering is to apply the trained data transformations to the training and test datasets as well as new sources of data for future predictions. This is an important step, as machine learning algorithms require the same data format as was used during model training to predict new values. The bake() function from recipes is used for this task.

7. Simple feature engineering pipeline

To demonstrate a simple feature engineering pipeline, let's build a recipe to log transform the total_time variable in the lead scoring dataset. This is a common transformation for variables with large values because it compresses the range of data values and can reduce variability.

8. Building a recipe object

First we pass our model formula, purchased tilde dot, to the recipe function. This will assign the purchased column as the outcome variable and all other columns as predictor variables. Then we pass the leads_training data to the data argument. This will be used to determine the data types of each column in our data. Then we pass our recipe object to the step_log() function and provide the total_time column and select a base of 10. Printing a recipe object will display the number of outcome and predictor variables as well as the encoded preprocessing operations.

9. Explore variable roles and types

When a recipe object is passed to the summary() function, a tibble with variable information is returned. The type column lists the variable data types, which is either numeric or nominal for categorical variables. The role column captures variable roles for modeling based on the provided model formula.

10. Training a recipe object

Next, we train our recipe by passing it to the prep() function. The training argument of prep() specifies the data on which to train data preprocessing steps. This should always be the training data. Printing a trained recipe will display which operations were successfully trained.

11. Transforming the training data

To apply our recipe to existing or new data, we must pass it to the bake function. The new_data argument of bake() specifies to which data to apply the trained recipe. Since leads_training was used to train our recipe, the transformations were retained by default in the prep() function. Setting new_data to NULL will return the preprocessed training data. Notice that total_time is now on a logarithm scale.

12. Transforming new data

To transform the test dataset, pass it to the new_data argument. The trained recipe will apply all steps to the new data source.

13. Let's get baking!

Let's get baking!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.