1. Univariate Gaussian Mixture Models with flexmix
2. Gender dataset
In the previous lesson, we analyzed the gender data to cluster the observations by the variable Weight.
3. Modeling with mixture models
And we said that a suitable distribution to cluster this sort of data was the Gaussian.
Also, we have determined that we will consider two clusters.
In this lesson, I'll show you how to fit this model with `flexmix` and how can we find the three parameters for each cluster.
4. flexmix function
The library `flexmix` has a main function called `flexmix()`. This function estimates the required parameters of the set model.
The principal arguments for `flexmix()` function are, first, the formula, which describes the model to be fitted. For the scope of this course, the formula will be equal to 1 since it represents parameters of the clusters that don't depend on other variables.
The second is the data frame.
The third, k, represents the number of clusters, which in our example will be two.
The fourth, model, specifies the distribution to be considered. In our example, is the univariate normal distribution represented by FLXMCnorm1.
Later, we'll use other distributions.
And finally, the argument control specifies the maximum number of iterations in the EM algorithm and the tolerance accepted for stoping the algorithm, among other things.
5. Fit univariate Gaussian mixture model
For our example, we use the variable Weight from the gender dataset to create the two clusters, a tiny tolerance and a maximum number of iterations of 10,000.
Setting the verbose argument to 1, we get the partial results, which show us that the algorithm converged after 3457 iterations.
6. The proportions: prior function
To see which proportions are estimated by flexmix, we can use the function `prior()`.
In this case, the estimated proportions tell us that each of the clusters has approximately the same importance.
7. The means and the sds: parameters function
To recover both means and both standard deviations, we use the function `parameters`().
However, if we want to save each cluster's parameters we can use the argument `component`.
For this example, we see that the first cluster has a mean of around 136 and a standard deviation of 19. The second cluster is explained by a Guassian with a mean of 187 and a standard deviation of 20.
8. Visualize the resulting distributions
Since we have the parameters estimated, we can plot each distribution along with the density histogram to check how well the clusters fit the data.
To do so, we use `stat_function`, which can plot any density distribution we want.
In this case, I am using a function called `fun_prop` which takes as arguments the mean, the standard deviation and the proportion of each cluster, and draws the corresponding distribution.
9. Visualize the resulting distributions
The plot looks like this.
You can observe that the estimated distributions nicely fit the data.
10. The probabilities and assignments
Another function that comes with `flexmix` is the function `posterior`, which provides us with the probability of belonging to each cluster.
For example, in the first row, there is 99.9% certainty that this observation belongs to the second cluster.
If we want to assign the observations to one of the two clusters, we could assign them to the cluster that presents the maximum probability.
Using the function clusters, this task is straightforward.
11. Assignments comparison
Finally, since this was an example where the gender labels were originally provided, we can compare how the clusterization performs in comparison with the real values.
Using the function `table()`, we can calculate a frequency table where the first row tells us that for the 5000 observations considered as females, 4500 were assigned to cluster 1 and the rest to cluster 2.
For the males, however, only 444 were considered as cluster 1.
12. Let's practice!
Now let's try some examples.