Get startedGet started for free

Introduction to model-based clustering

1. Introduction to model-based clustering

Hi, I'm Victor Medina. I'm a researcher at SBIF and I really enjoy using R to extract valuable insights from data. This course will introduce you to mixture models, one of the most interesting and useful statistical frameworks to cluster and extract patterns from data.

2. What is clustering?

First, let's clarify what do we mean when we talk about clustering. Simply, clustering is the procedure of partitioning a collection of observations into a set of meaningful subclasses or clusters. By meaningful, we suggest that all the observations belonging to a cluster share some similarities but are essentially distinct from the other clusters' observations. This procedure lets us explore the natural structure in a data set.

3. Applications of clustering

Cluster analysis is used among several disciplines. In medicine, for example, is used to analyze different types of tissues in a medical scan as an aid to the diagnosis of disease. In business, is used to discover different groups of customers in order to develop targeted marketing programs. And in social sciences, it can be used to identify zones in a city by the type of crimes to manage law enforcement resources more effectively, as we will see later.

4. Clustering methods

There are many approaches to clustering, and the choice will depend on the aim of the analysis. Widely used are the partitioning techniques, the hierarchical techniques, and the model-based methods. The first tries to find the centres of the clusters and assign each observation exclusively to the closest cluster. Example of this approach is Kmeans. The second connects the observations based on their similarity to start forming the clusters, which means the number of clusters is related to the number of connections we have made. In the lowest level for example, when we have no connections between the observations, each of them is a cluster. In the highest, however, we have all the observations connected to form a single cluster. The third models the observations as they are generated from a combination of probabilistic distributions. This approach is what concerns us and is suitable when we are interested not only in the clusters but also in measuring the probability of belonging to a cluster.

5. Gender dataset

To clarify the differences between these approaches, let's see an example using the Gender dataset. This data comprises ten thousand observations, including men and women. Here you can see the first six rows for the variables Height, Weight and Body Mass Index.

6. Gender dataset: Can you guess the gender?

Using `ggplot2` and the function `geom_points()`, we can visualize a scatterplot of the variables `BMI` versus `Weight`. The idea is to identify two subclasses that can be interpreted as the gender.

7. Gender dataset: Can you guess the gender?

You might have an idea that females tend to be in the left bottom of the plot and males in the upper right. We could still improve our beliefs using cluster analysis.

8. Under traditional cluster approaches

With Kmeans, for example, we find two subpopulations. But it is not realistic that what seems to be a straight line is an appropriate separation between genders. More suitable is to have a measure of the certainty of belonging to a cluster, like a probability.

9. Model-based clustering

Adding the density distributions to the original scatterplot, we can observe that in both axes appear two peaks, telling us how the points are concentrated or distributed.

10. Model-based clustering

In fact, we might get the impression that the two subclasses come from a probabilistic distribution. This is exactly what you have when you fit a mixture model because the way these models work is based on probability distributions. Instead of optimizing each observation to the clusters' centres, like Kmeans, we optimize a probabilistic model.

11. Let's practice!

Now, it's time for you to practice!