1. Principal component analysis
The next type of unsupervised clustering method is principal component analysis, or PCA.
2. Principal component analysis (PCA): Overview
PCA is a technique used to emphasize the variation present in a dataset. PCA finds the principal components of a dataset, with the first principal component, or PC1, representing the greatest amount of variance in the data.
3. Principal component analysis (PCA): Theory
To understand this a bit better, we can think of a dataset with two samples. We could plot the normalized counts of every gene for one sample on the x-axis and the other sample on the y-axis.
In this example, Gene A has four counts for sample 1 plotted on the x-axis and 5 counts for sample 2 on the y-axis. We can plot the other genes similarly.
4. Principal component analysis (PCA): Theory
We can draw a line through the dataset where there exists the most variation, or where there is the largest spread. In this example, the line with largest spread is between genes B and C. This line represents the first principal component.
The second most variation in the dataset, represented as PC2, must be perpendicular to PC1, in order to best describe the variance in the dataset not included in PC1.
In this example, PC2 is drawn between genes A and D. The spread is much smaller for PC2.
In reality, your dataset will have more samples and many more genes. The number of principal components is equal to the number of samples, n, in the dataset, so finding the largest amount of variation, PC1, means plotting a line through n-dimensional space.
5. Principal component analysis (PCA): Theory
The most variant genes for a principal component have the most influence on that principal component's direction.
In our example, the most variant genes for PC1, genes B and C, would affect the direction of the line more than genes A and D.
6. Principal component analysis (PCA): Theory
We give quantitative scores to genes based on how much they influence the different PCs.
7. Principal component analysis (PCA): Theory
A 'per sample' PC value is computed by taking the product of the influence and the normalized read count for each gene and summing across all genes.
8. Principal component analysis (PCA): Theory
For PCA we generally plot these per sample PC values. Samples that cluster together have more similar gene expression profiles than samples that cluster apart, especially for the most variant genes.
9. Principal component analysis (PCA): Theory
This is a good method to explore the quality of the data as we hope to see replicates cluster together and conditions to separate on PC1. Sample outliers and major sources of variation can also be identified with this method.
10. Principal component analysis (PCA): Theory
We can perform PCA using DESeq2's plotPCA() function to plot the first two PCs. This function takes as input the transformed vsd object, and we can use the intgroup argument to specify what factor in the metadata to use to color the plot.
We can see that the sample groups, normal and fibrosis, separate well on PC1. This means that our condition corresponds to PC1, which represents 88% of the variance in our data, while 4% is explained by PC2. This is great since it seems that a lot of the variation in gene expression in the dataset can likely be explained by the differences between sample groups.
However, if the samples do not separate by PC1 or PC2, the effect of the condition could be small or there are sometimes other larger sources of variation present. You can color the PCA by other factors, such as age, sex, batch, etcetera, to identify other major sources of variation that could correspond to one of the large principal components. We'll talk later about how we can account for these sources of variation in the model.
Just to note, if you would like to explore PCs other than PC1 or PC2, the prcomp() base function allows a more thorough analysis of PCs.
11. Let's practice!
Now let's try doing this ourselves.