1. Performing PCA in R
Like many things in data science, there's an easy way to perform PCA in R. Now that you have the understanding of what it's doing, let's get down to doing it in R!
2. NFL Combine Data
Taking a look at our real-world example of athletic data for prospective NFL players.
3. NFL Combine Data
The command prcomp() will do all of the analyses for you.
One thing to initially notice is that the variability is posed in terms of the standard deviations, which are just the square roots of the variances looked at in the last lesson.
Each of the principal components shows you the weights for each variable in the principal components. For example, height is given a weight of roughly 0-point-04 in the first principal component, while vertical jump is given a weight of -0-point-06.
4. NFL Combine Data
You can look again at how much each principal component contributes to variability in the data set by using the summary command. The first principal component contributes 96-point-72 percent of the variability in the data, the second just under two percent, and so on. The summary command is a big one in R.
5. NFL Combine Data
To actually apply the principal components to your data, you simply extract these vectors using the dollar sign x command, extracting as many principal components as you want. For example, the first two principal components look like this.
You can attach these new columns to original data using the cbind() command (for "column bind") to add a name to the data, so to speak.
6. Things to Do After PCA
Once your data has been through PCA, you can do many things. One of the first things you can look for is that your data is setting you up to solve the right problem. For example, in the NFL combine data, it's clear that the first principal component really just separates players by position, so the other principal components are actually important if you separate your data by position.
Another thing you can do is visualize your data. Most humans struggle with visualizing data in three or more dimensions. If you can collapse your data into two dimensions, you can possibly understand its structure visually.
PCA'ed data is usually better for clustering analysis than non-PCA'ed data.
PCA is good for supervised learning, since by construction there is no redundancy in the components, so it serves as a feature engineering procedure in such an effort.
7. Example - Data Visualization
Here's one example of some data exploration we can do. Looking at how someone's football position is related to their principal components it's pretty clear that position is a big factor.
If your job is to make distinctions within a position, you might have to take your analysis a step further.
8. Let's practice!
Let's finish the course by doing some PCA in R.