1. Checking normality of multivariate data
Normality is a convenient assumption used to justify and simplify parametric statistical tests, such as the two-sample t-test which is designed to test the difference between two populations on a single variable. Without univariate normality assumptions on the two populations whose equality is being tested, we have to rely on non-parametric tests, which are often less powerful and computationally expensive.
2. Why check normality?
Most classical multivariate techniques depend on multivariate normality. These techniques include multivariate regression, discriminant analysis, model-based clustering, principal component analysis, and multivariate analysis of variance.
Before using any such technique, we should check multivariate normality of the variables. We will start with tests for univariate normality and then discuss techniques for testing multivariate normality.
3. Review: univariate normality tests
Qqnorm is a widely used graphical technique for testing univariate normality. A qqnorm plot is a scatterplot created by plotting the theoretical quantiles against the sample quantiles from the data.
The graph shows the univariate normality test of the first column of the iris underscore raw data. We use the qqnorm() function on the relevant column, followed by the qqline() function for the reference straight line.
If the values lie along a straight line, the data follows a normal distribution with some mean and variance.
4. Review: univariate normality tests
In this case, we see that near the two extremes the points deviate from the straight line, indicating that the assumptions of normality might not be valid.
Deviation from the straight line might indicate heavier tails, the presence of skewness, outlying observations, or clustering of observations.
5. qqnorm of all variables
Although univariate normality does not imply multivariate normality, if any single variable fails to follow normality we cannot have joint multivariate normality. The mvn() function with argument univariatePlot equal qqplot allows us to check the univariate normality of each variable with a single command. Among the four qqnorm plots, the two upper plots, corresponding to sepal length and sepal width, are likely to follow univariate normal distributions, as the departure from the reference line is minimal.
6. MVN library multivariate normality test functions
For multivariate normality, the MVN library contains several analytical and graphical tests.
7. MVN library multivariate normality test functions
In this course, we will focus on Mardia’s test, Henze-Zirkler’s test, and the chi-square Q-Q plot, as the other tests for normality can be implemented in a similar fashion.
8. Using Mardia Test to check multivariate normality
To perform Mardia's test on the iris dataset we call mvn() function with the first four columns of iris underscore raw dataset and choose option mvnTest equals mardia.
The output provides tests for skewness and kurtosis.
The skewness measures the asymmetry of a distribution and kurtosis measures the relative proportion of extreme observations. For multivariate normality, both p-values should be greater than 0 point 05.
Here the p value for skewness is less than 0 point 05, so we conclude that the data are not multivariate normal.
9. Using qqplot from Mardia Test to check multivariate normality
Additionally, if we want to view the qqplot corresponding to the multivariate normality test we set multivariatePlot equals qq.
The qqplot agrees strongly with the graphical output where we can see a clear departure from the reference line, in the upper tail, indicating that the normality test has failed.
10. Using Henze-Zirkler's test to check multivariate normality
The mvn function with option mvntest equals hz can be used to implement Henze-Zirkler's multivariate normality test. Again, the test shows that the data does not follow a multivariate normal.
11. Testing multivariate normality by species
Although the test for multivariate normality failed for the iris dataset as a whole, we can also test multivariate normality for each species. The qqplot and the numeric test on the setosa species show that the multivariate normality assumption is valid for this subset.
12. Let's make use of the tests for multivariate normality!
Now, let's practice the graphical and analytical tests for multivariate normality.