The Importance of Dimensionality Reduction in Data and Model Building

1. The Importance of Dimensionality Reduction in Data and Model Building

Now that we've gained an intuition for how to identify features with little information, let's learn about the curse of dimensionality and the problems it causes related to bias and overfitting. This will further motivate the need for dimensionality reduction.

2. The curse of dimensionality

The curse of dimensionality refers to the fact that a marginal increase in dimensionality requires an exponential increase in data volume — that is, an exponential increase in observations. If we don't get an exponential increase in observations, then our data suffers from sparsity, which can lead to bias and overfitting. Let's illustrate this with some examples. Here we have a simple data set containing gender and veteran status.

3. The curse of dimensionality

Each feature has two distinct values. Gender can be female or male. Veteran can be yes or no. To represent all combinations, we need four observations.

4. The curse of dimensionality

If we add blood type, which has four possible values (A, B, AB, and O), we now need sixteen observations.

5. The curse of dimensionality

The product of two values for gender, two values for veteran, and four values for blood type is sixteen combinations. If we continue adding dimensions, we can see how a small increase in dimensionality requires an exponential increase in number of observations.

6. Sparsity

Now, let's illustrate sparsity. Here are all sixteen possible combinations of values for gender, veteran, and blood type.

7. Sparsity

If we collected sixteen observations in the real world, we are not likely to end up with a sample that represents all sixteen combinations.

8. Sparsity

There would be combinations that are not represented in the sample. This is data sparsity. If the data on the right was our training data set, our machine learning model would be biased towards those combinations present in the training set. It would not know anything about the missing combinations. That's the definition of bias; and that's a form of overfitting. Our model would only learn the patterns found in the training set and would not generalize well to other data that contains the never-before-seen combinations.

9. Sparsity: training and test sets

Furthermore, when we split data for model building, we need all sixteen combinations to appear at least once, but preferably many times, in both the training and testing sets.

10. Sparsity: training and test sets

That means the original data set needs at least thirty-two observations, and that's a minimum.

11. Sparsity: training and test sets

Ideally, for good model training we'd like each combination to appear many times. Let's assume four times. Then the training and testing sets would both need sixty-four observations.

12. Sparsity: training and test sets

The full data set would need at least one-hundred and twenty-eight observations. And that's for only three dimensions! What if we had one-hundred, or a thousand dimensions? The curse of dimensionality becomes enormous.

13. Calculate minimum number of observations

Now, let's turn our attention to calculating the minimum number of observations given the dimensionality of the data and the number of values each feature can take on. We use expand_grid() to store all sixteen possible combinations into blood_type_df.

14. Calculate minimum number of observations

We pipe the dataframe blood_type_df to summarize() and use the everything() selector to count the unique values in each column, using unique() to remove duplicates and then pass the unique vector to length() to get a count. This gives us a one-row data frame with the counts for each feature, which we pipe to prod() to multiply the counts and get the number of feature-value combinations — in this case, sixteen, the number of combinations we created with expand_grid() previously.

15. Multiple representations of each combination

But remember that we ideally want each combination to appear many times in both the training and testing sets. So we'd multiply the number of unique combinations by, say, four.

16. Let's practice!

Let's practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.