Get startedGet started for free

RNA-Seq DE analysis summary

1. RNA-Seq DE analysis summary 2

Let's start the DE analysis workflow!

2. DESeq workflow - normalization

We start with normalizing the raw counts to assess sample-level quality control metrics. Normalization allows comparison of counts within a sample and between the different samples. To compare counts for the same gene between different samples we need to normalize for library size, or total number of reads per sample, while accounting for RNA composition. First, we calculate the normalization factors or size factors with the estimateSizeFactors() function, then we use the size factors to calculate the normalized counts using the counts() function.

3. Unsupervised clustering analyses: log transformation

The normalized counts are needed to explore the similarity between samples using hierarchical correlation heatmaps and PCA. When using these visualization methods, it's recommended to log transform the normalized counts to improve the distances of the clusters for visualization. The vst() function performs this transformation while being blind to sample group information.

4. Unsupervised clustering analyses

Now the clustering analyses are performed to identify the major sources of variation in our dataset and to explore the clustering of the samples and identify outliers or worrisome samples using the correlation heatmap and pca methods.

5. Unsupervised clustering analyses - heatmap

For the heatmap, we extract the transformed normalized counts using the assay() function on the transformed counts object, then compute the correlation values between samples using the cor() function. Finally, we plot the heatmap using the pheatmap() function on the correlated values and annotating with the relevant column or columns in the metadata. Remember we are looking for our biological replicates to cluster together and the sample groups to cluster apart. Outlier samples can also be identified, which often tend to have low correlation values with all other samples.

6. Unsupervised clustering analyses - pca

We can also explore sample clustering by PCA. We can plot the first and second principal components using DESeq2's plotPCA() function and giving the transformed counts object. Again, we are looking for our replicates to cluster together and sample groups to cluster apart. If there is an outlier, we should see it using both clustering techniques before removing. Also, we should be looking for major sources of variation driving separation on PC1 and PC2. We hope our condition of interest is separating on PC1 and/or PC2.

7. Running the DE analysis

After QC and identification of the major sources of variation in our dataset, we can move on to differential expression analysis.

8. Running the DE analysis

If during the cluster analyses additional sources of variation are identified or samples are removed, then we would need to update the DESeq2 object prior to running the analysis. Finally, we can run the DESeq() function to perform the dispersion estimation, fitting the counts to the negative binomial model and differential expression testing.

9. DESeq2 workflow - model

To check the fit of our data to the model, we can explore the dispersion estimates of the data using the plotDispEsts() function. We expect the dispersions to decrease with increasing mean and generally follow the fitted line.

10. DESeq2 workflow - contrasts and LFC shrinkage

Then, we can extract the results and perform log2 foldchange shrinkage.

11. DESeq2 workflow - contrasts and LFC shrinkage

We can use the results() function to extract the results for our comparisons of interest using contrasts and use the lfcShrink() function to generate more accurate log2 foldchange estimates.

12. DESeq2 workflow - LFC shrinkage

To generate the final results table, we add gene annotations and order the genes by p-adjusted values. Note that for adding the annotations, the annotables organism table should use the same version or build of the genome used for alignment and counting.

13. DESeq2 workflow - results exploration

Finally, we can identify the significant differentially expressed genes.

14. DESeq2 workflow - results exploration

To do this we subset the results using the p-adjusted value less than the alpha threshold.

15. DESeq2 workflow - results visualization

We can visualize our results with a series of plots, including the MA plot using the plotMA() function, volcano plot, and expression heatmap. We could also plot the counts of any genes of interest.

16. Let's practice!

Now let's explore the workflow in bit more depth.