1. Review K-means results
I don't know about you, but the results of the last exercise were a little unexpected!
You used three approaches for finding clusters and got three completely different answers.
Which of them is the right one?
2. Three clustering results
If there is one point that I want you to remember from this class, is that the answer is always it depends.
It depends on the clustering setup, it depends on the question we are trying to answer and it depends on our understanding of the data that we are working with.
To say it another way, clustering methods require a certain amount of subjectivity. They are the looking glass through which we can see a new perspective on our data but it is up to us to judiciously use this perspective.
In this case if you would ask for my opinion, I would say that the analysis of the hierarchical based clustering seems to make the most sense here. The three distinct clusters of occupations grouped similar slopes of wage growth effectively while separating the unique trends that appear.
3. Comparing the two clustering methods
Is this always the case? What are the differences between k-means and hierarchical clustering?
Well there are some fundamental differences between the two.
kmeans relies exclusively on euclidean distance whereas hierarchical clustering can handle virtually any distance metric.
kmeans requires a random step that may yield different results if the process is re-run, this would not occur in hierarchical clustering.
to estimate the value of k, we can use silhouette analysis and the elbow method for k-means, but the same could be said for hierarchical clustering which additionally has the added benefit of leveraging the dendrogram
So why would we ever use k-means clustering instead of hierarchical clustering. The main reason is that the k-means algorithm is less computationally expensive and can be run on much larger data within a reasonable time frame. This is the reason that this algorithm maintains such wide use and popularity.
4. What have you learned?
I hope you enjoyed this journey to develop the tools and intuition for working with unsupervised clustering as much as I did.
In chapter one you learned the central concept to all clustering, distance. You also learned how important scale can be when calculating distance.
In chapter two you learned the fundamentals of hierarchical clustering where you utilized distance to iteratively build a dendrogram and then break it down into clusters.
In chapter three you worked with the k-means clustering method and learned about its associated tools.
You learned a lot, you should give yourself a pat on the back.
5. A lot more to learn
Of course this is only the beginning of your journey. These are just some of the tools you may encounter as you delve further into the world of unsupervised clustering.
As a bonus, the pam function you used for silhouette analysis actually used the k-mediods method. Its very similar to k-means except that it can accommodate distance matrices with arbitrary distances just like hierarchical clustering. You should try it out.
Two other methods you might be interested in are DBSCAN and Optics clustering, both are very commonly used algorithms that we sadly don't have time for in this course but I strongly encourage you to take what you've learned here to pursue them.
6. Congratulations!
If I can leave you with one parting thought that has helped me along my path in data science. Remember that building intuition for your methods is just as, if not more important than learning how to use their associated tools.
Like the explorers of old, we data scientists have a lot of uncharted waters ahead of us, best we understand how our ship works.