1. Visualizing hierarchies
A huge part of your work as a data scientist will be the communication of your insights to other people.
2. Visualizations communicate insight
Visualizations are an excellent way to share your findings, particularly with a non-technical audience. In this chapter, you'll learn about two unsupervised learning techniques for visualization: t-SNE and hierarchical clustering. t-SNE, which we'll consider later, creates a 2d map of any dataset, and conveys useful information about the proximity of the samples to one another. First up, however, let's learn about hierarchical clustering.
3. A hierarchy of groups
You've already seen many hierarchical clusterings in the real world. For example, living things can be organized into small narrow groups, like humans, apes, snakes and lizards, or into larger, broader groups like mammals and reptiles, or even broader groups like animals and plants. These groups are contained in one another, and form a hierarchy. Analogously, hierarchical clustering arranges samples into a hierarchy of clusters.
4. Eurovision scoring dataset
Hierarchical clustering can organize any sort of data into a hierarchy, not just samples of plants and animals. Let's consider a new type of dataset, describing how countries scored performances at the Eurovision 2016 song contest. The data is arranged in a rectangular array, where the rows of the array show how many points a country gave to each song. The "samples" in this case are the countries.
5. Hierarchical clustering of voting countries
The result of applying hierarchical clustering to the Eurovision scores can be visualized as a tree-like diagram called a "dendrogram". This single picture reveals a great deal of information about the voting behavior of countries at the Eurovision. The dendrogram groups the countries into larger and larger clusters, and many of these clusters are immediately recognizable as containing countries that are close to one another geographically, or that have close cultural or political ties, or that belong to single language group. So hierarchical clustering can produce great visualizations. But how does it work?
6. Hierarchical clustering
Hierarchical clustering proceeds in steps. In the beginning, every country is its own cluster - so there are as many clusters as there are countries! At each step, the two closest clusters are merged. This decreases the number of clusters, and eventually, there is only one cluster left, and it contains all the countries. This process is actually a particular type of hierarchical clustering called "agglomerative clustering" - there is also "divisive clustering", which works the other way around. We haven't defined yet what it means for two clusters to be close, but we'll revisit that later on.
7. The dendrogram of a hierarchical clustering
The entire process of the hierarchical clustering is encoded in the dendrogram. At the bottom, each country is in a cluster of its own. The clustering then proceeds from the bottom up. Clusters are represented as vertical lines, and a joining of vertical lines indicates a merging of clusters. To understand better, let's zoom in
8. The dendrogram of a hierarchical clustering
and look at just one part of this dendrogram.
9. Dendrograms, step-by-step
In the beginning, there are six clusters, each containing only one country.
10. Dendrograms, step-by-step
The first merging is here, where the clusters containing Cyprus and Greece are merged together in a single cluster.
11. Dendrograms, step-by-step
Later on, this new cluster is merged with the cluster containing Bulgaria.
12. Dendrograms, step-by-step
Shortly after that, the clusters containing Moldova and Russia are merged,
13. Dendrograms, step-by-step
which later is in turn merged with the cluster containing Armenia.
14. Dendrograms, step-by-step
Later still, the two big composite clusters are merged together. This process continues
15. Dendrograms, step-by-step
until there is only one cluster left, and it contains all the countries.
16. Hierarchical clustering with SciPy
We'll use functions from scipy to perform a hierarchical clustering on the array of scores. For the dendrogram, we'll also need a list of country names. Firstly, import the linkage and dendrogram functions. Then, apply the linkage function to the sample array. Its the linkage function that performs the hierarchical clustering. Notice there is an extra method parameter - we'll cover that in the next video. Now pass the output of linkage to the dendrogram function, specifying the list of country names as the labels parameter.
In the next video, you'll learn how to extract information from a hierarchical clustering,
17. Let's practice!
But for now, let's see what hierarchical clustering can do with some real-world datasets.