Exercise

Visitors from outer space

So you concluded you must resort to dimensionality reduction because of very limited computational resources you have available for crunching your hyper-dimensional dataset.

And for the same reason, you feel that the PCA algorithm is the best choice due to its speed and simplicity.

Good. But did you check your data for outliers? Let's see how they could impact your results.

A 3-dimensional dataset of 1000 samples (X_raw), slightly "contaminated" with 5 outliers (X_new), has been pre-loaded, as seen on Figure 1.

On Figure 2 you see that the impact of these outliers (in red) is trivial and creates no problem in extracting actual principal components.

But what happens if they are further away?

Instructions 1/2

undefined XP
    1
    2
  • Add 5 outliers with an outlier_distance of 200.