Get startedGet started for free

Introduction to differential expression analysis

1. Differential expression analysis

In chapter 1 we learned about the goals of differential expression analysis, features of RNA-Seq count data, and the types of data required for performing differential expression or DE analysis. In this chapter, we will explore the workflow using DESeq2 and prepare our fibrosis experimental data for differential expression analysis.

2. Differential expression analysis: tools

While there are a large number of statistical packages developed for DE analysis, DESeq2 and EdgeR are two of the most popular tools. Both tools use the negative binomial model to model the raw count data, employ similar methods, and typically, yield similar results. Both are pretty stringent and have a good balance between sensitivity and specificity. Limma-Voom is another set of tools often used together for DE analysis, while also a good method, it can be a bit less sensitive for smaller sample sizes. We will be using DESeq2 for the DE analysis, which has an extensive vignette available to help with questions.

3. Differential expression analysis: DESeq2 vignette

We can use the `vignette()` function to open the DESeq2 vignette, which contains detailed information about the workflow. This should often be the first place to look when you have questions regarding the tool or workflow. For example, if we want to know how to get help for specific questions regarding DESeq2, we can click on the second bullet point under the Standard workflow.

4. Differential expression analysis: DESeq2 vignette

This link will take us to the help section, where the documentation suggests posting questions to the Bioconductor support site if answers can't be found in the vignette.

5. Differential expression analysis: DESeq2 workflow

We will be using the DESeq2 tool to perform the differential expression analysis, but what are the steps we will need to perform? Displayed in the workflow are the steps in the differential expression analysis, separated into quality control and DE analysis steps. To start, we will take the count matrix containing the raw counts for each sample and perform quality control steps. First, the counts will be normalized to account for differences in library depth. Then, principal component analysis and hierarchical clustering using heatmaps will be performed to identify potential sample outliers and major sources of variation in the data. After QC, the DE analysis will be performed, including the modeling of counts, shrinking the log2 fold changes, and testing for differential expression. We will cover each of these steps in more detail as we progress through the workflow.

6. Bringing in data for DESeq2

To prepare our data for the DESeq2 workflow, we need the raw counts of the number of reads aligning to each gene and the associated sample metadata brought into R. Let's start with the raw counts. We can use the read-dot-csv() function like we did previously to bring it in. Then take a quick peek at the data frame with the View() function. Just to note, in addition to output from standard alignment and counting tools, DESeq2 will also take counts from pseudo-alignment tools like Kallisto and Salmon. However, these abundance estimates need to be formatted properly for DESeq2 by using the tximport package as discussed in the vignette.

7. Bringing in data for DESeq2: metadata

We can also load in the metadata file we created earlier using the read-dot-csv() function. Now we have both of the files needed for performing the differential expression analysis.

8. Let's practice!

Let's explore the DESeq2 vignette a bit more on our own and practice bringing in our data.