1. Bivariate Gaussian Mixture Models
2. Gender data
In the last two lessons, we have applied a univariate Gaussian mixture model to cluster the gender dataset with the Weight variable, but sometimes it is worth considering that some of the other available variables carry important information that can be used to improve the clustering.
In this lesson, we will delve into clustering with bivariate Gaussian distributions.
3. Exploratory data analysis
You saw that in the histogram for the variable Weight, you can identify a composition of two distributions.
When we add another variable, such as BMI, the density plot is now depicted in three dimensions but the behavior remains alike. That is to say, we still could see two subpopulations.
4. Modeling with mixture models
So the answers to the three questions will change a bit.
The suitable distribution, now that we are considering two variables, is the bivariate Gaussian distribution, which I will describe in the following slides.
The second answer is still two clusters.
And for the third question, we now need to estimate the means for each distribution, which are now in two dimensions.
Moreover, the concept of standard deviation extends to incorporate not only the level of dispersion within the variable, as in the univariate case, but also between the variables as we will see.
The estimations will be done in the next lesson. These are performed by the EM algorithm which is implemented in the `flexmix` library.
5. Bivariate Gaussian distribution
To understand how the model changes when considering two variables, let's start by describing the bivariate Gaussian distribution.
To illustrate this distribution, instead of using three-dimensional plots, is more convenient to depict it in two dimensional-scatterplot where the axis represent the variables and the dots, the observations.
The area enclosing the red ellipse represents 95% of the observations and the red dot is the mean of the distribution.
Observe that now the mean is formed by two values.
For this toy example, the mean is ten, for variable one, and five, for variable two.
The standard deviation, or better said, the variance, which corresponds to the square of the standard deviation, now is a matrix formed by four values instead of one.
The diagonal values of the matrix are the corresponding variances of the distributions shown on the axes.
The cross-terms are always equal and correspond to the covariance which measures the joint variability between the two variables. In this case, these values are zero.
6. Bivariate Gaussian distribution
To better understand the implication of the cross-term in the covariance matrix, let's consider the same values for the mean and for the diagonal of the matrix.
For example, if we change the cross-terms from zero to twenty, now appears that both variables change accordingly.
This suggests that if the value of variable 1, for instance, increases, it is expected to increase by the same metric for variable 2.
7. Coming back to the Gender data
Now, coming back to the Gender dataset, we will cluster the data with bivariate Gaussian distributions, considering two clusters.
The parameters are the two proportions, the two means, where each mean is formed by two values and the two covariance matrices, each of them with four values.
Since we have defined the problem, we are now ready to estimate the parameters with `flexmix`.
8. Let's practice!
But first, let's try some examples.