Structure of mixture models

1. Structure of mixture models

So far, you should be familiar with the simulation of Gaussian mixture models. In this lesson, I will extend the notion of mixture models to make it useful to several sorts of problems.

2. Description of mixture models

To cluster your data with mixture models, there are three questions to be answered. First, which is the suitable distribution to model your data? There is no easy way to answer this question if you are not familiar with different probability distributions and the problems they solve. That's why we'll see practical examples for different data. Second, how many clusters explain your data? To answer this, we could either consult experts or build many mixture models for a range of clusters and select the one that satisfies a statistical criterion, as we will see later. Third, once we know which is the suitable distribution and the number of clusters, what are the parameters and how can we estimate them? To answer this, we'll dive into a useful method called Expectation-Maximization algorithm at the end of this chapter.

3. Example 1: Gender data set

To understand how these three questions relate to mixture models, let's discuss three different examples that we'll work on throughout the course. The first is the Gender dataset, where we saw that the BMI and the weight were well explained using the Gaussian distribution.

4. Example 1: Gender dataset results

So, in terms of the structure, we can model the data with a two-dimensional Gaussian distribution. We also considered two clusters, because we wanted to identify two sub-populations and relate them to the gender. And finally, I have actually estimated, but not yet shown you how to calculate, the means, the standard deviations and the proportions for each Gaussian depicted by the percentages in the picture. All of these correspond to the parameters of this particular model.

5. Example 2: Handwritten digits

The second example involves clustering handwritten digit images using Bernoulli distributions. A black and white image can be thought as a long line of zeros and ones, where the 1s represent black dots and the 0s, white dots. Since this is a binary behaviour, we can no longer use Gaussian distributions and we need to pick a more suitable one.

6. Example 2: Handwritten digits results

We will see that by making use of Bernoulli distributions, we can extract the patterns that represent all these digits and surprisingly recover the form of the digits themselves. The estimated parameters are formed by the proportions of each cluster, in this case of comparing 2 handwritten digits we have 50 and 50, and also the probability of being 1 for every dot, which is represented by the grey tone on the picture. The closer to one, the closer to black tone.

7. Example 3: Crime types

Our third example will involve crime statistics in Chicago, with the count data such as in the image, where each row is a community area, and the columns are the type of committed crimes. We can make use of Poisson distributions to cluster the communities and find which areas are more or less dangerous to live in.

8. Example 3: Crime types results

The results are depicted on the map, where red zones are more dangerous than green ones. Unlike the other examples, here we have more than two clusters with different proportions. For example, near 60% of the communities belong to cluster 5 and 6, the safest ones. The way I selected the number of clusters here was to pick the mixture model that outperforms a statistical criterion called BIC, which you will cover later. The estimated parameters are represented by the average number of crimes type and the clusters proportions. Similar data are found in business where instead of communities you have clients and instead of the type of crime, you have products, so you might find clusters of clients that have similar purchasing behaviour.

9. Let's practice!

Now let's practice with these three data sets!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.