Get startedGet started for free

The curse of dimensionality

1. The curse of dimensionality

As a data scientist, you'd rather have a dataset with a lot of features compared to one with just a few. There is however one downside to high-dimensional datasets that you may not have considered thus far. Models tend to overfit badly on high-dimensional data. The solution is of course to reduce dimensionality, but which features should you drop? In this chapter we'll look into detecting low quality features and how to remove them. But first, we'll build some intuition on why models overfit.

2. From observation to pattern

Let's illustrate this with an example. Say we want to predict the city in which a house is located based on some features of that house. But all we have to train the model is this tiny dataset with the house price in million Euros for two observations,

3. From observation to pattern

of which we can make a pretty dull distribution plot. Surely the model will overfit. It will just memorize the two training examples instead of deriving a general pattern for houses from each city. If we want the model to generalize we need to give it more observations of house prices for each city.

4. From observation to pattern

When we expand to 1000 observations per city the price distributions become clear and a model should be able to train on this data without overfitting.

5. Building a city classifier - data split

To test this assumption we'll build a model. But before we do so, we'll split the data into a 70% train and 30% test set using scikit learn's train_test_split() function.

6. Building a city classifier - model fit

We'll then instantiate a classifier model, here we chose a support vector machine classifier, and fit it to the training data.

7. Building a city classifier - predict

We can then assess the accuracy of the model on the test set, this is the 30% of the original data that the model didn't see during the training process. Our model is able to assign 82.6% of unseen houses to the correct city. From our visual exploration, this could be expected since there was quite some overlap in the single feature our model was trained on. If we want to know whether our model overfitted to the dataset we can have a look at the accuracy on the training set. If this accuracy is much higher than that on the test set we can conclude that the model didn't generalize well but simply memorized all training examples. Fortunately, this is not the case here.

8. Adding features

If we want to improve the accuracy of our model, we'll have to add features to the dataset, so that in cases where the price of the house doesn't allow us to derive the location, something else will.

9. Adding features

Features like the number of floors, bathrooms, or surface area in the house all could be helpful. However, with each feature that we add, we should also increase the number of observations of houses in our dataset. If we don't we'll end up with a lot of unique combinations of features that models can easily memorize and thus overfit to. In fact, to avoid overfitting the number of observations should increase exponentially with the number of features. Since this becomes really problematic for high dimensional datasets this phenomenon is known as the curse of dimensionality. The solution to this is of course to apply dimensionality reduction.

10. Let's practice!

But first it's your turn to test a model for overfitting.