Get startedGet started for free

The Linear Algebra Behind PCA

1. The Linear Algebra Behind PCA

Now that we've seen that some "big data" can be looked at differently, it's time to see how this is done using the linear algebra we've learned in this course.

2. Theory

The matrix A superscript t, the transpose of A, is the matrix made by interchanging the rows and columns of A. If your data set is in a matrix A, the mean of each column has been subtracted from each element in a given column, and there are n elements in each column, then the i-jth element of the product of A's transpose with A, divided by the n - 1, is the covariance between the variables in the ith and jth column of the data in the matrix. The covariance of two data sets is a measure of how related the two variables are. A negative covariance means the variables are negatively related, a positive covariance means they are positively related, while a covariance near zero means they are roughly linearly independent. As a corollary, the ith element of the diagonal of A transpose A divided by n - 1 is the variance of the data in the ith column of A. If A is a matrix of data, it's likely not a square matrix, since it will (hopefully) have more rows - cases, than columns - features. However, A^TA is a square matrix - the rows and the columns equal to the number of features.

3. Theory

To check this out, look at the 5 by 2 matrix A. First, we have to subtract the mean from each of the columns.

4. Theory

Then, we can compute the matrix A transpose times A, divided by the number of rows of A minus 1, the degrees of freedom for each of the variables. Notice that we get the promised 2 by 2 matrix. Also, if there's any justice in the world, the covariance between the 1st and 2nd variable should be the same as that between the 2nd and 1st variable. We find this, as the covariance is 5. Also notice that the diagonal terms, as discussed previously, are the variances of each of the columns, 2.5 for the first variable and 10 for the second.

5. PCA

The eigenvalues of this matrix are real, and their corresponding eigenvectors point in distinct directions. The total variance of the data set is the sum of the eigenvalues of A transpose times A divided by the number of degrees of freedom. These eigenvectors are called the principal components of the data set in the matrix A. The direction in which one of the v's points can explain its eigenvalue's worth of the total variance in the dataset. If this eigenvalue is large in relation to the total variance, dimension reduction can be done.

6. Example

Notice if we look at the eigen data, we find one eigenvalue makes up all of the total variance. This makes sense, since the second column is simply twice as big as the first, so while this data set has two columns, it really only has one column's worth of information. Notice that the first eigenvector is such that the second element is exactly twice the first, coinciding with the previous insight.

7. Let's practice!

Let's look at a real-world example.