1. Mean vector and variance-covariance matrix
Now that we have read multivariate data, we will explore the dataset by looking at various summary statistics, like means and variances.
2. Mean represents the location of the distribution
The univariate mean identifies the location of a distribution on the number line. For multivariate data, the mean identifies the location of the distribution in a multidimensional space. In the figure on the right, the red dot represents the bivariate mean.
3. Variance-covariance matrix is the spread
The univariate variance specifies how spread out the observations are from the mean, given by the red line segment.
The variance-covariance measures spread of multivariate data in several directions. The angle and width of the two red line segments on the right specify the spread of the data along major sources of variation, which is also shown by the gray ellipse.
4. Calculating the mean
To calculate the mean of all four measurements, disregarding the Species, we use the colMeans() function with the first four columns of the iris underscore raw dataset.
Since these observations come from different species, we should calculate the mean vector for each species. There are two functions, by() and aggregate(), which subset the data and calculate the mean within the subset.
5. Calculating the group mean using by
The first argument of the by() command is the variables whose means we want to calculate, the second argument, INDICES, is the variable we want to group by, and the third argument is the function that we want to use to calculate the means, for example, colMeans(). Here we calculate the means of the first four columns, grouping by species.
6. Calculating the group mean using aggregate
In contrast, the aggregate() function uses a formula interface. The period on the left-hand side of the formula indicates that we want the means of all variables in the data frame, except the variable specified as the one to group by, which is listed immediately after the tilde.
7. Calculating the variance-covariance and correlation matrices
The variance-covariance matrix of the data frame can be calculated using the var() function. It produces a variance-covariance matrix where the rows and columns have the same names as the dataset columns.
The diagonal elements of this matrix are the individual variances. For example, 0 point 6857, the first entry, is the variance of Sepal Length. The off-diagonal elements give the covariance between the corresponding variables, for example, minus 0 point 0393 is the covariance between Sepal width and Sepal length.
The correlation matrix is a generalization of the concept of correlation between two variables. The off-diagonal elements give the correlation between the ith and jth variables and the diagonal entries are always 1, since it is a variable's correlation with itself.
8. Visualization of correlation matrix
The corrplot() function from the corrplot library enables us to visualize correlations. Here, we used the argument method equals "ellipse". The sign of the correlation is specified by the shade and tilt of the ellipse and the magnitude is specified by how much it differs in shape from a circle. For example, a very high positive correlation between petal width and petal length is represented by the thin, right-tilted, dark blue ellipse. In contrast, the circle like shape representing the correlation between Sepal width and length signifies a small correlation between the corresponding variables.
9. Interpretation of means
Next, we provide some interpretation for mean as the location parameter. Plotting the mean of the petal length and petal width using colored triangles, we see that the species means differ between the three groups.
10. Interpretation of variances
Visualizing the variance-covariance matrix using the colored ellipses provides information regarding how compact or dispersed the values are for each group. The small red ellipse corresponds to setosa with the variances 0 point 03 and 0 point 011, whereas the comparatively thinner ellipse corresponds to the stronger correlation in versicolor.
11. Let's practice!
Now it's your turn.