1. t-SNE for 2-dimensional maps
In this video, you'll learn about an unsupervised learning method for visualization called "t-SNE".
2. t-SNE for 2-dimensional maps
t-SNE stands for "t-distributed stochastic neighbor embedding". It has a complicated name, but it serves a very simple purpose. It maps samples from their high-dimensional space into a 2- or 3-dimensional space so they can visualized. While some distortion is inevitable, t-SNE does a great job of approximately representing the distances between the samples. For this reason, t-SNE is an invaluable visual aid for understanding a dataset.
3. t-SNE on the iris dataset
To see what sorts of insights are possible with t-SNE, let's look at how it performs on the iris dataset. The iris samples are in a four dimensional space, where each dimension corresponds to one of the four iris measurements, such as petal length and petal width. Now t-SNE was given only the measurements of the iris samples. In particular it wasn't given any information about the three species of iris. But if we color the species differently on the scatter plot, we see that t-SNE has kept the species separate.
4. Interpreting t-SNE scatter plots
This scatter plot gives us a new insight, however. We learn that there are two iris species, versicolor and virginica, whose samples are close together in space. So it could happen that the iris dataset appears to have two clusters, instead of three. This is compatible with our previous examples using k-means, where we saw that a clustering with 2 clusters also had relatively low inertia, meaning tight clusters.
5. t-SNE in sklearn
t-SNE is available in scikit-learn, but it works a little differently to the fit/transform components you've already met. Let's see it in action on the iris dataset. The samples are in a 2-dimensional numpy array, and there is a list giving the species of each sample.
6. t-SNE in sklearn
To start with, import TSNE and create a TSNE object. Apply the fit_transform method to the samples, and then make a scatter plot of the result, coloring the points using the species. There are two aspects that deserve special attention: the fit_transform method, and the learning rate.
7. t-SNE has only fit_transform()
t-SNE only has a fit_transform method. As you might expect, the fit_transform method simultaneously fits the model and transforms the data. However, t-SNE does not have separate fit and transform methods. This means that you can't extend a t-SNE map to include new samples. Instead, you have to start over each time.
8. t-SNE learning rate
The second thing to notice is the learning rate. The learning rate makes the use of t-SNE more complicated than some other techniques. You may need to try different learning rates for different datasets. It is clear, however, when you've made a bad choice, because all the samples appear bunched together in the scatter plot. Normally it's enough to try a few values between 50 and 200.
9. Different every time
A final thing to be aware of is that the axes of a t-SNE plot do not have any interpretable meaning. In fact, they are different every time t-SNE is applied, even on the same data. For example, here are three t-SNE plots of the scaled Piedmont wine samples, generated using the same code. Note that while the orientation of the plot is different each time, the three wine varieties, represented here using colors, have the same position relative to one another.
10. Let's practice!
You are now equipped to use t-SNE to gain insight into some real-world datasets. Let's get some practice!