1. Principal Component selection
In the previous lesson we saw how you can set the number of components that the PCA algorithm should calculate.
An alternative technique is telling PCA the minimal proportion of variance we want to keep and let the algorithm decide on the number of components it needs to achieve that.
2. Setting an explained variance threshold
We can do this by passing a number between 0 and 1 to the n_components parameter of PCA. When we pass it 0-point-9 it will make sure to select enough components to explain 90% of the variance. This turns out to be 5 for the Pokemon data.
One problem is that whether we set the number of components as an integer or a ratio, we're still just making these numbers up using gut feeling.
Fact is, there is not a single right answer to the question "how many components should I keep?" since it depends on how much information you are willing to sacrifice to reduce complexity.
3. An optimal number of components
There is, however, a trick that can help you find a good balance.
When you plot the explained variance ratio of a fitted PCA instance, you'll get to see that most of the explained variance is concentrated in the first few components. As you go from left to right in this type of plot you'll often see that the explained variance ratio per component starts to level out quite abruptly. The location where this shift happens is known as
4. An optimal number of components
the 'elbow' in the plot. And it typically gives you a good starting point for the number of components to keep. Do note that the x-axis shows you the index of the components and not the total number. So since the elbow is at the component with index 1 here, we'd select 2 components.
5. PCA operations
Up until now, we've seen how you can use PCA to go from an input feature dataset X to a NumPy array of principal components pc, either by first fitting pca to the data and then transforming that data in two operations
6. PCA operations
or in one go with the .fit_transform() method.
A final trick that I'd like to teach you, is how to go back from the principal components to the original feature space.
7. PCA operations
This can be done by calling the .inverse_transform() method on the principal component array. Because you typically lose information going from left to right in this overview, you'll see that the result from going back to the original feature space will have changed somewhat.
8. Compressing images
An application where this is relevant is image compression. Let's have a look at the "Labeled Faces in the Wild" dataset.
The 15 images you see here are the test set.
9. Compressing images
It's a two dimensional NumPy array with 15 arrays of 2914 elements each. These elements correspond to the grayscale value of a pixel in the 62 by 47 pixel images.
Our training set contains 1333 of such images.
10. Compressing images
We can build a pipeline where we tell PCA to select only 290 components and then fit this pipeline to our training data. If we then use this fitted model to transform the unseen test data, we'll get a 10 fold reduction in the number of features. We could now save our images with 10 times less disk space!
11. Rebuilding images
Finally, we can perform the inverse transform operation to rebuild our pixels from the principal components. We then use a custom made img_plotter() function to create this output.
12. Rebuilding images
While there is quality loss, the result is not bad.
13. Let's practice!
Now it's your turn to practice these techniques.