Get startedGet started for free

Increasing the information content of raw data

1. Increasing the information content of raw data

Let us look at how some ways to enhance raw data.

2. Dealing with raw data

Raw data rarely comes ready to use. One of the most common issues is missing values. Beyond the apparent lack of information these instances provide, missing values can prevent some algorithms from running, and merely deleting incomplete rows can mean throwing away valuable information. There are many reasons why values can be missing, and a complete treatment of this problem is beyond the scope of this course. However, when values are missing completely at random, meaning there is no systematic reason for missing certain values from the dataset, we can deal with them via simple imputation techniques. While values might not be missing, we might be interested in replacing them with a different structure that can improve model performance and interpretability. A typical case is nominal values that we can represent better numerically using dummy variables.

3. Dealing with raw data

Let us take a look at a concrete dataset.

4. The loans dataset

We are interested in automating loan viability assessment. The loans dataset has 614 instances of loans assessed by humans along with several characteristics of the applicants, including gender, marital status, number of dependents, loan amount, applicant income, and six others. From the table, we can see that some values are missing. Look, for example, at the Loan Amount in the first row.

5. Missing values

The "naniar" package provides us with an excellent visual way to identify missing values at a glance, using the "vis_miss" function with "loans" as an argument. While this gives us a good idea of the structure of the missing values, it needs to be more specific.

6. Missing values

By making the appropriate selection, we can zoom in to see only the columns with missing values. Values can be missing for various reasons that we must carefully understand before taking action. In our case, they seem to be missing completely at random (aka MCAT), so we can rely on traditional imputation methods.

7. Missing values and dummy variables

In building our recipe, we can take care of missing values. K Nearest Neighbors is a commonly used imputation method that considers the closest K neighbors (the default is five) to the missing variable and assigns it the average for continuous values and the majority vote for nominal instances. "step_dummy" creates new dummy variables for the selected nominal predictors. All of them in our case. As we are at it, we will also update the role of "Loan_ID" to "ID" so we can use it for reference later without interfering with the model's computations. Printing the "lr_recipe" object gives an excellent summary of our feature engineering steps.

8. Finding the right recipe step

We have covered a few steps that we can use to engineer our features and will cover many more in subsequent chapters. But the tidymodels framework provides us with an astonishingly large number of options beyond this course's scope. To explore potential steps to improve your model, you can take advantage of the "search recipe steps" tool in the online documentation.

9. Fitting and assessing our model

We are ready to fit our workflow to the training data and assess the model's performance. We can predict loan viability with nearly 80% accuracy and a ROC_AUC of 0-point-73.

10. Let's practice!

It is time to give it a shot.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.