1. Visualize clusters
Hi everyone! Now that you are familiar with hierarchical clustering and how the algorithm works, let us take a step in the direction of visualizing clusters.
2. Why visualize clusters?
Why do we need to visualize clusters? One can quickly make sense of the clusters formed by any algorithm by visually analyzing it rather than just looking at cluster centers.
It can serve as an additional step for validation of clusters formed.
Additionally, you may also spot trends in your data by visually going through it. Let us now look at possible ways of visualizing the clusters that we have formed in our earlier exercise.
3. An introduction to seaborn
Seaborn is a data visualization library in Python that is based on matplotlib.
It provides better default plotting themes, which can be easily and intuitively modified.
It has functions for quick visualizations in the context of data analytics. In this course on clustering, we use pandas DataFrames to store our data, often adding a separate column for cluster centers.
Seaborn provides an argument in its scatterplot method to allow us to use different colors for cluster labels to differentiate the clusters when visualizing them. Let us compare the implementation of the two plotting techniques - matplotlib and seaborn.
4. Visualize clusters with matplotlib
To visualize clusters, we first import the pyplot class in matplotlib.
Let us start with a pandas DataFrame which has the columns - x, y and label for its x and y coordinates and cluster labels, A and B.
We will use the c argument of the scatter method, to assign a color to each cluster. However, we first need to manually map each cluster to a color.
Therefore, we define a dictionary named colors with the cluster labels as keys, and the color associated with the clusters as its values.
We then pass a list of colors to c argument using a lambda function, which returns the corresponding value of each cluster label.
5. Visualize clusters with seaborn
The implementation in seaborn is fairly straightforward with the built in scatterplot method.
We first import the pyplot class and seaborn library.
We use the same DataFrame as earlier to visualize the clusters.
To visualize the data points with each point associated with a separate color, we use the hue argument of the scatterplot method, and pass on the column name of the cluster labels, which is labels in this example.
Now that we have written the code for each of them, let us compare the results. Recall from the last lesson that seaborn shows an extra cluster with label 0 if the cluster labels are integers. In this example, we have manually assigned string cluster labels, so this issue will not arise.
6. Comparison of both methods of visualization
Although the results are comparable, there are two reasons why we prefer seaborn. First, the implementation using seaborn was more convenient once you have stored cluster labels in your DataFrame. Second, you do not need to manually select colors in seaborn as it would be using a default palette no matter how many clusters you have.
7. Next up: Try some visualizations
Now that you know how to visualize data using two libraries, let us try some exercises.