1. Gaussian distribution
In the last lesson, we actually fit a mixture of two Gaussian, or normal, distributions.
2. Mixture model to Gender dataset
Here are the real gender labels of each observation and also the areas depicted by the ellipses, which represent where the most probable zones are to find each subpopulation according to the fitting.
Since mixture models are based on probability distributions, in this lesson we'll start by studying one of the most famous, the Gaussian, before diving back into the structure of a mixture model in Chapter 2.
It is worth saying that this lesson is not an exhaustive cover of the Gaussian distribution, but rather a reference for the Mixture Models.
3. Packages for fitting mixture models
First, though, let's comment on the packages to fit mixture models in R.
Currently, there are many packages on Cran that can fit mixture models, the most popular ones are:
`Mixtools`, which is a great library, but the Poisson distribution is not implemented so far and we'll need it later in the course.
`bayesmix`, which uses Bayesian inference, and is outside of the scope of this course.
`EMCluster`, which is really easy to implement but only works with Gaussian distributions.
`Flexmix`, which is the one you will learn because not only have plenty of probability distributions been implemented but it also gives you the possibility of going deeper with mixture models if you choose.
4. Properties of Gaussian distribution
Gaussian distributions are characterized by two measures; the mean and the standard deviation.
The mean represents the central point where the values tend to fall.
And the standard deviation is the measure that determines the degree to which the values differ from the mean. A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out.
The range formed by four standard deviations centered on the mean covers approximately 99 per cent of the probable values.
5. Sample from a Gaussian distribution
To understand mixture models, we can start simulating the random variables involved to picture what we will try to model.
We begin by sampling from a univariate Gaussian distribution using the function `rnorm()`. This function takes the number of samples we want, the mean and sd, which refers to the standard deviation.
In this example, we generate 100 values from a Gaussian distribution with a mean of 10 and a standard deviation of 5.
Using the function `head()`, we can see the first six values from our sample.
6. Estimation of the mean
Usually, when we collect data we don't know the mean and the standard deviation in advance, so we need to estimate these parameters.
To estimate the mean, we simply take the sample mean of the observations.
For this data, the mean estimation gives us 10.36, very close to the real value of ten.
7. Estimation of the Standard Deviation (sd)
To estimate the standard deviation, we start by subtracting the estimated mean to each observation, then, we square these quantities and take the mean. Finally, we square root this quantity.
Here I show you how to calculate the standard deviation manually, but for the purpose of this course, we can just use the `sd` function.
Observe that the estimated standard deviation is quite similar to the real value of 5.
8. Visualizing the estimated Gaussian distribution
To create a histogram in R we use the function `geom_histogram()`, specifying "y axis" equals density to scale it into frequencies. Observe we put the aesthetic inside `geom_histogram()`.
We also add the estimated curve using `stat_function()`, specifying that the geometry is a line, the function is the density of the Gaussian, and the arguments correspond to the estimated mean and standard deviation.
9. Visualizing the sample with estimated Gaussian distribution
See how the estimated distribution fits quite nice to the observations!
10. Let's practice!
Now let's try some examples.