1. Clustering with multiple features
In the final video exercise of the course, let us perform clustering on the FIFA dataset again. However, this time we will consider more than two variables and try to interpret and validate the results of clustering.
2. Basic checks
While it is important to understand that all features can not be visualized and assessed at the same time when clustering with more than 3 features, we will discuss a few techniques to validate your results. This step assumes that you have created the elbow plot, performed the clustering process and generated cluster labels.
First, you can check how the cluster centers vary with respect to the overall data. If you notice that cluster centers of some features do not vary significantly with respect to the overall data, perhaps, it is an indication that you can drop that feature in the next run.
Next, you can also look at the sizes of the clusters formed. If one or more clusters are significantly smaller than the rest, you may want to double if their cluster centers are similar to other clusters. If the answer is yes, you may want to reduce the number of clusters in subsequent runs.
In this case, you notice that the second cluster is significantly smaller. It is because we have performed clustering on three attacking attributes, for which goalkeepers have a very low value as indicated by the cluster centers. Hence, the smaller cluster is composed primarily of goalkeepers, as we will explore later.
3. Visualizations
Even though all variables cannot be visualized across clusters, there are other simpler visualizations that help you understand the results of clustering.
You may either visualize cluster centers or other variables stacked against each other.
In pandas, you can use the plot method after groupby to generate such plots. In this example, the bar chart is demonstrated. You can also create a line chart to see how variables vary across clusters.
In our case, you will notice that all three attributes are significantly higher in one cluster.
4. Top items in clusters
Finally, let us check five players from each cluster.
As expected the first cluster has top attack minded players like Ronaldo, Messi and Neymar. As explained earlier, the second cluster has top goalkeepers like Manuel Neuer, De Gea and Buffon, who have very low values for traits like volleys and heading accuracy.
This determines that our clustering was appropriate.
5. Feature reduction
When dealing with a large number of features, certain techniques of feature reduction may be used. Two popular tools to reduce the number of features are factor analysis and multidimensional scaling.
Although these are beyond the scope of this course, you may consider them as a precursor to clustering.
6. Final exercises!
Let us now move on to the final exercises of this course.