Get startedGet started for free

Creating new features using domain knowledge

1. Creating new features using domain knowledge

Often, we can extract better information from raw data by representing it differently.

2. The importance of domain knowledge

Domain knowledge enables us to identify and create relevant features for a particular model or task. Feature engineering is about creating new input features from existing ones. Some examples include: Financial: What are the critical determinants of bankruptcy? Medical: What pre-existing conditions are relevant to a specific treatment? Marketing: Which features distinguish the behavior of a consumer group from another?

3. Creating variables based on professional experience

We want to predict hotel cancellations from a dataset containing the following features. The variables "StaysInWeekendNights" and "StaysInWeekNights" are informative, while "arrival_date" is raw data. However, we can use it to create new features like: Day of the week, the week itself, month, and a few holidays.

4. The tidymodels framework

"tidymodels" is a collection of modeling and machine learning packages using "tidyverse" principles. We will focus on a basic workflow throughout our exploration of feature engineering. We start by loading the data. Then, we declare our model. This step will prove beneficial as we develop more nuanced models. Next, we split the data into training and test sets and set up a recipe. Here is where feature engineering is done. Recipes gather variable creation steps, among other things. All this is bundled in a workflow we can fit and assess.

5. Setting up our data for analysis

:et's transform all character features into factors by using mutate(). We can use the combo "across(where())" along with "is_character" to apply the conversion to all character features. Now, we split our data into test and train sets using the tidyverse framework. We first create a "split" object and set "strata" equal to our target variable: "IsCanceled." Stratifying ensures both data sets maintain similar proportions of the target variable values. The default split assigns 3/4 of the data for training. We can change it by specifying the value of the parameter "prop" in the "initial_split" function. Note that the proportions are similar!

6. Building a workflow

We are ready to build a workflow. We start by declaring our model, as logistic regression. We then define a recipe by specifying a formula and a data set, and use pipes to add preprocessing and feature engineering steps. Recipes will be a key tool in our feature engineering exploration. The update_role function sets up the variable "Agent" with the new role: "ID." So we can keep it in the data frame as reference while ignored by the modeling process. step_date takes the "arrival_date" variable and creates new features: "day of the week," "week," and "month.", allowing the model to identify days like Fridays or months like December explicitly. step_holiday creates a variable for each US holiday. Since we don't need arrival_date anymore, we can remove it using step_rm. Finally, we create dummy variables for all nominal predictors using step_dummy and the helper function all_nominal_predictors(). Printing lr_recipe, we get a summary of all our steps.

7. Building a workflow

The "workflow" function bundles our model and recipe into a workflow object using "add_model" and "add_recipe". We can directly fit the workflow to the training data using "fit". Having a workflow object allows us to reuse it by, for example, training it with a different data set without defining all of our transformations again, and ensuring consistency.

8. Building a workflow

We can view a summary of our model using "tidy".

9. Assessing model performance

We can use the "augment" function on the test data frame to assess our model performance. "augment" adds prediction values and probabilities to the data that we can use to compute metrics such as roc_auc and accuracy. We can graph a roc_curve using roc_curve and "autoplot".

10. Let's practice!

It is time to put these ideas to work.