1. Dimensionality reduction: visualization techniques
Welcome back! In this video we're going to go over dimensionality reduction from a slightly different perspective, one that covers techniques for visualization.
2. Why dimensionality reduction?
We didn't get the chance to cover exactly why dimensionality reduction is such a good idea in the last lesson, just the consequences of the curse of dimensionality as the number of features increase. These are the three biggest reasons to perform dimensionality reduction on a dataset that has high dimensionality. First, it speeds up training of machine learning models. This is because less dimensions mean algorithms can simply run faster. Second, it helps us to visualize the data since visualizing more than 3 dimensions is troublesome. Third, it improves the accuracy of our trained models because it removes unimportant and collinear information resulting in less noise and redundancy which equates to more accurately trained models.
3. Visualization techniques
So we've already talked about PCA in the last lesson so we'll focus here on how to use it for visualization. We'll also discuss t-distributed stochastic neighbor embedding, or t-sne for short. And you'll get to practice some PCA visualizations in the exercises that follow.
4. Visualizing with PCA
Recall that the first principal component is a linear combination of original features which captures the maximum variance in the dataset and determines the direction of highest variability as seen in this plot as the red vector. This vector minimizes the sum of squares between each rotated and scaled data point and the vector. The second principal component is also a linear combination of original features, capturing the remaining variance in the dataset and is uncorrelated with the first principal component as demonstrated by the fact that the green vector is perpendicular to the red PC.
5. Scree plot
Recall that you printed out the variance explained for the principal components in the last lesson using the explained_variance_ratio_ method. We can take the explained variance ratio returned for each principal component calculated and, when plotted, creates what is called a scree plot. This helps to visually determine how many principal components it takes to describe the maximal amount of variance or information contained in the dataset to inform the optimal number of principal components to take forward for modeling. We'll discuss modeling with principal components later in the course.
6. t-SNE
Instead of being a mathematical technique like PCA, t-distributed stochastic neighbor embedding, which I'll call t-sne going forward, is a probabilistic one. It takes pairs of data points in the high-dimensional space and computes the probability that they are related and then chooses a low-dimensional embedding to produce a similar distribution. These embeddings can then be visualized.
7. Visualizing with t-SNE
Since t-sne takes a lot of computing power, you won't practice it in the exercises, but here is a code snippet that you can use to practice t-sne locally in a Jupyter notebook if you'd like. This creates a t-sne model and then a corresponding visualization using the loans dataset we've been working with. For more info, see the t-sne documentation link at the bottom of the slide.
8. Visualizing with t-SNE
And this is the scatterplot the code on the previous slide creates with the 2 principal components projected onto a lower dimensional space. The gray corresponds to Fully Paid and blue to Charged Off Loan Status. Remember, the goal here is to create 2 features with the most predictive power so the more separated the groupings are the better. There is some slight separation but also a lot of overlap indicating that, at least for this dataset, we'll need to take more care if we are to successfully build an accurate predictive model.
9. PCA vs t-SNE digits data
Finally, here is an example of a comparison of PCA and tsne using the digits dataset. As you can see, t-sne is much better at class separation here. This is also a good demonstration of the fact that visualizing more than 3 dimensions in a given plot is not generally recommended. Here each PC is one dimension, while the colors are the third. Additional dimensions would simply get lost.
10. Let's practice!
Alright, now it's your turn to practice visualizations with PCA.