1. Univariate Gaussian Mixture Models
Up to now, you have learned what are mixture models and how they are estimated.
In this chapter, we will cluster the gender dataset applying our previously-learned concepts and estimate the parameters using the `flexmix` package.
Gaussian mixture models are formed by Gaussian distributions, which can have many dimensions.
To understand how the model is interpreted when we increase the dimension of the Gaussian distribution, we will start with the univariate case to pave the path to the bivariate case in the second part of this chapter.
2. Gender dataset
Let's recall what the gender dataset looks like.
We have four columns: the registered gender, the height, the weight and the body mass index for 10,000 observations.
Recall that we want to recognize two clusters that can relate to the gender.
In practical applications though, chances are you not be provided with the real labels of the observations.
Here, however, we have them just to illustrate how the clustering behaves.
3. Modeling with Mixture Models
Earlier, we saw that to utilize mixture models for clustering we require the answers to three questions.
The type of distributions we should consider for the data, how many clusters and the parameters estimations for the model.
So, let's walk through the answer to these questions.
4. Clustering with one variable
In principle, we could consider as many variables as we want to perform clustering, but the interpretation gets more difficult.
So in order to understand the transition from one variable to more than one, let's start by picking the variable Weight.
Later, we will incorporate another variable to the clustering to see if we can improve the analysis.
5. Exploratory data analysis
The first thing before fitting any model is to perform exploratory data analysis on your variables.
The histogram is a good visual tool to get an idea of a variable's distribution and, in this case, is also useful to view the number of subpopulations.
6. Which distribution?
Since the variable Weight can take any value for the range considered and is not restricted to integer values, we should think about using continuous probability distributions like the Gaussian.
Without making use of any statistical test and looking at the histogram, we can intuitively realize that the Weight distributes closely to a composition of two Gaussian distributions.
Thus, the distribution to be considered for this example is the Gaussian distribution in one dimension.
7. How many clusters?
The second question to fit a mixture model is how many clusters should be considered.
In this example, we are taking only two clusters which relate graphically, at least, to the histogram.
Later we will apply a statistical criterion called BIC which will help us to pick the number of clusters.
8. Which parameters and how to estimate them?
The third question is, what are the parameters and their estimations?
Since we are considering two clusters distributed as Gaussian distributions, we need to estimate the mean and the standard deviation of each Gaussian plus the proportions. Thus, there are six parameters to be estimated in total.
To do so, we will apply the expectation-maximization algorithm. In the last chapter, you implemented a simple version of this algorithm, so you have an intuition on how it works.
The library `flexmix` comes to simplify this process, having already implemented the algorithm for many different distributions as you will see.
9. Let's practice!
But first let's put this into practice.