1. RNA-Seq DE analysis summary 2
Let's start the DE analysis workflow!
2. DESeq workflow - normalization
We start with normalizing the raw counts to assess sample-level quality control metrics.
Normalization allows comparison of counts within a sample and between the different samples. To compare counts for the same gene between different samples we need to normalize for library size, or total number of reads per sample, while accounting for RNA composition.
First, we calculate the normalization factors or size factors with the estimateSizeFactors() function, then we use the size factors to calculate the normalized counts using the counts() function.
3. Unsupervised clustering analyses: log transformation
The normalized counts are needed to explore the similarity between samples using hierarchical correlation heatmaps and PCA.
When using these visualization methods, it's recommended to log transform the normalized counts to improve the distances of the clusters for visualization.
The vst() function performs this transformation while being blind to sample group information.
4. Unsupervised clustering analyses
Now the clustering analyses are performed to identify the major sources of variation in our dataset and to explore the clustering of the samples and identify outliers or worrisome samples using the correlation heatmap and pca methods.
5. Unsupervised clustering analyses - heatmap
For the heatmap, we extract the transformed normalized counts using the assay() function on the transformed counts object, then compute the correlation values between samples using the cor() function.
Finally, we plot the heatmap using the pheatmap() function on the correlated values and annotating with the relevant column or columns in the metadata.
Remember we are looking for our biological replicates to cluster together and the sample groups to cluster apart. Outlier samples can also be identified, which often tend to have low correlation values with all other samples.
6. Unsupervised clustering analyses - pca
We can also explore sample clustering by PCA. We can plot the first and second principal components using DESeq2's plotPCA() function and giving the transformed counts object.
Again, we are looking for our replicates to cluster together and sample groups to cluster apart. If there is an outlier, we should see it using both clustering techniques before removing.
Also, we should be looking for major sources of variation driving separation on PC1 and PC2. We hope our condition of interest is separating on PC1 and/or PC2.
7. Running the DE analysis
After QC and identification of the major sources of variation in our dataset, we can move on to differential expression analysis.
8. Running the DE analysis
If during the cluster analyses additional sources of variation are identified or samples are removed, then we would need to update the DESeq2 object prior to running the analysis.
Finally, we can run the DESeq() function to perform the dispersion estimation, fitting the counts to the negative binomial model and differential expression testing.
9. DESeq2 workflow - model
To check the fit of our data to the model, we can explore the dispersion estimates of the data using the plotDispEsts() function.
We expect the dispersions to decrease with increasing mean and generally follow the fitted line.
10. DESeq2 workflow - contrasts and LFC shrinkage
Then, we can extract the results and perform log2 foldchange shrinkage.
11. DESeq2 workflow - contrasts and LFC shrinkage
We can use the results() function to extract the results for our comparisons of interest using contrasts and use the lfcShrink() function to generate more accurate log2 foldchange estimates.
12. DESeq2 workflow - LFC shrinkage
To generate the final results table, we add gene annotations and order the genes by p-adjusted values.
Note that for adding the annotations, the annotables organism table should use the same version or build of the genome used for alignment and counting.
13. DESeq2 workflow - results exploration
Finally, we can identify the significant differentially expressed genes.
14. DESeq2 workflow - results exploration
To do this we subset the results using the p-adjusted value less than the alpha threshold.
15. DESeq2 workflow - results visualization
We can visualize our results with a series of plots, including the MA plot using the plotMA() function, volcano plot, and expression heatmap. We could also plot the counts of any genes of interest.
16. Let's practice!
Now let's explore the workflow in bit more depth.