1. What is cluster analysis?
Hi, my name is Dima and I am very excited to have you join me in learning all about cluster analysis in R. Cluster analysis is a form of data exploration, and the key to harnessing its power lies in understanding how it works. So, in this course you won't just learn the tools necessary to perform cluster analysis - that's the easy part - I will work with you to build the intuition behind the underlying methods. But, before we get to the how, let's take a moment to discuss, what is clustering?
2. What is clustering?
No matter whether you are working with medical data,
3. What is clustering?
retail data,
4. What is clustering?
or sports data, as a data scientist you are often presented with a bunch of data that you need to make sense of.
5. What is clustering?
To understand what clustering is, let's put aside the details of our data and instead focus on the toy example
6. What is clustering?
where the data is represented as a matrix containing entries of card suits.
7. What is clustering?
To look at it another way, this matrix is composed of rows containing our observations and columns that tell us something that we measured across these observations.
We will refer to these columns as the features of our observations.
In cluster analysis we are interested in grouping our observations such that all members of a group are similar to one another and at the same time they are distinctly different from all members outside of this group.
Imagine in this example we performed cluster analysis to find which observations are similar to one another based on what suit appears in each column.
8. What is clustering?
In this case we identified three groups and colored the observations accordingly.
To better see these pattens, lets re-organize our observation into their respective colored clusters.
9. What is clustering?
Here we can start to see clear patterns that emerge.
Fundamentally, this is how cluster analysis works.
10. What is clustering?
Or to put it another way, cluster analysis is a form of exploratory data analysis where observations are divided into meaningful groups that share common characteristics amongst each other.
So what are the steps involved in performing cluster analysis?
11. The flow of cluster analysis
Well, first, you must make sure that your data is ready for clustering, meaning that your data does not have any missing values and that your features are on similar scales.
12. The flow of cluster analysis
Next, you must decide on what metric is appropriate to capture the similarity between your observations using the features that you have.
13. The flow of cluster analysis
Once you have calculated this you can use a clustering method to group your observations based on how similar they are to each other into clusters.
14. The flow of cluster analysis
But, most importantly you will need to analyze the output of these clusters to determine whether they provide any meaningful insight into your data. This often requires a deep understanding of the problem and the data that you are working with.
15. The flow of cluster analysis
As you can see in this flow chart, the analysis you perform on these clusters may require you to iterate on the clustering steps until you converge on a meaningful grouping of your data.
16. Structure of this course
The first three chapters of this course will help you unpack this process.
In this chapter you will gain a deeper understanding of what it means for two observation to be similar - or more specifically ,dissimilar. You will also learn why the features of your data need to be comparable to one another.
17. Structure of this course
In chapters two and three you will learn how to use two commonly used clustering methods: hierarchical clustering and k-means clustering.
At the end of these chapters and in chapter four you will work through two case studies where clustering analysis provides a unique perspective into the underlying data.
18. Let's learn!
So, let's begin!