Increasing the information content of raw data

1. Increasing the information content of raw data

Let us look at how some ways to enhance raw data.

2. Dealing with raw data

Raw data rarely comes ready to use. One of the most common issues is missing values. Beyond the apparent lack of information these instances provide, missing values can prevent some algorithms from running, and merely deleting incomplete rows can mean throwing away valuable information. There are many reasons why values can be missing, and a complete treatment of this problem is beyond the scope of this course. However, when values are missing completely at random, meaning there is no systematic reason for missing certain values from the dataset, we can deal with them via simple imputation techniques. While values might not be missing, we might be interested in replacing them with a different structure that can improve model performance and interpretability. A typical case is nominal values that we can represent better numerically using dummy variables.

3. Dealing with raw data

Let us take a look at a concrete dataset.

4. The loans dataset

We are interested in automating loan viability assessment. The loans dataset has 614 instances of loans assessed by humans along with several characteristics of the applicants, including gender, marital status, number of dependents, loan amount, applicant income, and six others. From the table, we can see that some values are missing. Look, for example, at the Loan Amount in the first row.

5. Missing values

The "naniar" package provides us with an excellent visual way to identify missing values at a glance, using the "vis_miss" function with "loans" as an argument. While this gives us a good idea of the structure of the missing values, it needs to be more specific.

6. Missing values

By making the appropriate selection, we can zoom in to see only the columns with missing values. Values can be missing for various reasons that we must carefully understand before taking action. In our case, they seem to be missing completely at random (aka MCAT), so we can rely on traditional imputation methods.

7. Missing values and dummy variables

In building our recipe, we can take care of missing values. K Nearest Neighbors is a commonly used imputation method that considers the closest K neighbors (the default is five) to the missing variable and assigns it the average for continuous values and the majority vote for nominal instances. "step_dummy" creates new dummy variables for the selected nominal predictors. All of them in our case. As we are at it, we will also update the role of "Loan_ID" to "ID" so we can use it for reference later without interfering with the model's computations. Printing the "lr_recipe" object gives an excellent summary of our feature engineering steps.

8. Finding the right recipe step

We have covered a few steps that we can use to engineer our features and will cover many more in subsequent chapters. But the tidymodels framework provides us with an astonishingly large number of options beyond this course's scope. To explore potential steps to improve your model, you can take advantage of the "search recipe steps" tool in the online documentation.

9. Fitting and assessing our model

We are ready to fit our workflow to the training data and assess the model's performance. We can predict loan viability with nearly 80% accuracy and a ROC_AUC of 0-point-73.

10. Let's practice!

It is time to give it a shot.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Feature Engineering in R

IntermediateSkill Level

4.8+

97 reviews

You’ll wrap up the course by learning about feature engineering and machine learning techniques. You’ll begin by focusing on the problems associated with using all available features in a model and the importance of identifying irrelevant and redundant features and learning to remove these features using embedded methods such as lasso and elastic-net. Next, you’ll explore shrinkage methods such as lasso, ridge, and elastic-net, which can be used to regularize feature weights or select features by setting coefficients to zero. Finally, you’ll finish by focusing on creating an end-to-end feature engineering workflow and reviewing and practicing the previously learned concepts and functions in a small project.

Exercise 1: Reducing the model's features Exercise 2: Sifting through variable importance Exercise 3: Assessing model performance using all available predictors Exercise 4: Building a reduced model Exercise 5: Shrinkage methods Exercise 6: Manual regularization with Lasso Exercise 7: Tuning the penalty Exercise 8: Finalizing the model Exercise 9: Putting it all together Exercise 10: Prep and split Exercise 11: Preprocess Exercise 12: Model Exercise 13: Assess Exercise 14: Congratulations!