Get startedGet started for free

Principal Component Analysis

1. Principal Component Analysis

When working with multivariate data, we are often presented with a large number of strongly correlated variables. This poses computational challenges and the results are hard to interpret. As a result, data scientists often choose to work with a few relevant but uncorrelated variables derived from the original variables in the dataset.

2. Principal Component Analysis (PCA) goals

Principal Component Analysis, or PCA, provides a framework for constructing new variables which are weighted sums of existing variables, with the goal of making them uncorrelated. Further, only a few of these variables might allow us to capture a majority of the variation present in the full dataset.

3. Algorithm

We first need to construct the principal components or PCs. In this three-dimensional data example, the first PC explains the maximum possible variation in the orange direction. PC2, in blue is uncorrelated with PC1 and explains the maximum remaining variation in that direction. PC3 explains the remaining variation in the green direction and is uncorrelated to both PC1 and PC2. Fortunately, we do not have to code the above optimization steps, since the princomp() function will give us all the PCs.

4. Principal Component Analysis in R

There are several functions that can perform PCA in R. We will use the princomp() function. The princomp() function takes a data frame or matrix as the first argument. By default, the function uses the covariance matrix to construct the PCs. If you are interested in the scaled variables instead of unscaled variables, you can specify cor equals TRUE to use the correlation matrix instead. The third argument, score, takes a logical value indicating whether the score on each PC should be calculated. We will discuss these in more detail later.

5. Principal Component Analysis of mtcars dataset

Let's use the princomp() function on a subset of the mtcars dataset. The first five entries of the mtcars dataset show that the am and gear variables are binary. Standard PCA is not designed to work on binary or categorical data, so we need to exclude these variables.

6. Selecting numeric columns from mtcars dataset

First, create a new dataset, mtcars dot sub, excluding the two binary variables. Then use the princomp() function on the new mtcars dot sub dataset to create a PCA object cars dot pca. Be sure to set cor and scores equal to TRUE since we are interested in the scaled variables and also want to calculate the scores. In the next slide, we will explore the contents of the cars dot pca object.

7. princomp function output

Typing cars dot pca provides the basic information about how much variation each of the nine components explains. Note that the magnitude of the variation given by the standard deviations are arranged from largest to smallest, implying that the first few components explain a large portion of the overall variation. The summary function gives the proportion and cumulative proportion of variation explained. Reading the cumulative proportion row, we can see that the first 4 components explain more than 96 percent of the variation in the data. In the next video, we will use these values to decide how many PCs to retain in the final model.

8. Let's apply principal component analysis!

Now let's apply principal component analysis on the state dot x77 data, which contains multiple observations about each US state.