Selecting based on missing values

1. Selecting based on missing values

Welcome back! Let's dive deeper into feature selection and build on what we've learned about removing features with missing values.

2. Calculate missing values ratio

We've already worked with missing value counts, but a missing value ratio — the value count divided by the number of observations — is more helpful for establishing a cutoff for filtering. In R, we start by using nrow() to find n, the total number of observations. We calculate the missing value count like we did before, using summarize() and across() with sum() and is-dot-na(). Then we re-orient them vertically with pivot_longer(), naming the new columns feature and num_missing_values. Then we use mutate() to create missing_val_ratio by dividing num_missing_values by n. We store the missing value ratios into missing_vals_df.

3. Missing values ratio output

Notice how age and outstanding_debt are missing sixty-one point three percent and ninety-four point two percent of their values, respectively.

4. Rules of thumb for missing value ratio threshold

Unfortunately, there are no strict rules for cutoffs because it depends on the feature importance. For example, if we use credit_df to predict the credit worthiness of individuals, which would be more informative — outstanding debt or age? Outstanding debt likely informs credit worthiness more than age. If age has low feature importance, we might remove it regardless of its missing value ratio; while outstanding_debt might be very important, so we will likely keep it regardless of a high missing value ratio. But at what point is it missing too many values to be helpful? Here are some rules of thumb. If less than twenty percent of the values are missing, keep the feature. If twenty to eighty percent of the values are missing, consider the feature importance. If more than eighty percent are missing, discard it.

5. Create the missing values filter

To create a filter with the ratios in missing_vals_df we pipe them to filter() and keep features with missing_val_ratio less than or equal to fifty percent for simplicity. In reality, we would tailor the threshold to each feature. Then, we pull() the feature vector and see that missing_vals_filter then contains the column names of the features we will keep.

6. Apply missing values filter

We can apply missing_vals_filter using dplyr's select() function and store the data frame in filtered_credit_df. Here we show the first three rows of filtered_credit_df.

7. The tidymodel approach

But that's a lot of work, isn't it? Fortunately, the tidymodel's package has a recipe step called step_filter_missing(). Let's see how it works. First, we create a recipe object and store it in missing_vals_recipe. In recipe(), we define our model with a formula where credit_score is the target variable and all other variables are predictors. The data parameter tells the recipe to train on credit_df. Then we add step_filter_missing() to the recipe. all_predictors() instructs the recipe to apply the filter to only predictors, not the target variable. The threshold parameter specifies the missing value cutoff, meaning we'll keep features with less than fifty percent missing values. Then we call prep() to train the recipe. To apply the recipe we call the bake() function, pass it missing_vals_recipe and specify new_data as NULL to apply the recipe to the same data set we trained the recipe on — that is, credit_df.

8. Baked recipe output

Here are the first five rows of filtered_credit_df. We see that the filter step removed all but annual_income, num_of_loan, and credit_score — just like in our example.

9. Let's practice!

Now, let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.