Get startedGet started for free

DE analysis

1. Overview of the DE analysis

Now that we have explored the quality of our samples, removed any outlier samples, and assessed the major sources of variation present, we can start the differential expression analysis with DESeq2.

2. Review the dataset/question

Remember, by performing the differential expression analysis with DESeq2, we are trying to identify which genes have significant differences in expression between the normal and fibrosis sample groups.

3. Overview of the DE analysis

The differential expression analysis with DESeq2 consists of roughly three steps: fitting the raw counts for each gene to the DESeq2 negative binomial model and testing for differential expression, shrinking the log2 fold changes, and extracting and visualizing the results.

4. DESeq2 workflow: Model

We will start by fitting the raw counts to the DESeq2 model. To do this, DESeq2 will first need to estimate the size factors, if they haven't already been estimated, and estimate the variation in expression across replicates for each gene. After these calculations, the data can be fit to the negative binomial model.

5. DESeq2 workflow: Model

To perform these calculations and fit the negative binomial model requires only two functions. The first function creates the DESeq2 object, which is the same function we used previously during count normalization. We wouldn't need to re-create the object unless we removed samples or found additional sources of variation during QC using PCA and the correlation heatmap. When we created the DESeq2 object, we provided the raw counts, metadata, and design formula. We have explored the raw counts and metadata, but what exactly was specified with the design formula? The design formula is an important part of modeling, telling DESeq2 the known major sources of variation to control for, or regress out, as well as, the condition of interest to use for differential expression testing.

6. DESeq2 workflow: Design formula

For example, if this is your metadata and you know that strain, sex, and treatment are major sources of variation in the data as determined by the PCA and heatmap, then all of these factors should be included in the design formula. If my condition of interest is treatment, then it would come last in the formula with the other factors preceding it in any order. Therefore, the design formula would be: tilde, strain plus sex plus treatment. The tilde tells DESeq2 to model the counts using the following formula, so it should always proceed your factors. Also, the factor names in the design formula need to exactly match the column names in the metadata.

7. DESeq2 workflow: Design formula

DESeq2 also allows for complex designs. For instance, using the same metadata, if we wanted to know the effect of sex on the effect of treatment, we could use an interaction term. In this case, we could regress out the variation due to strain, sex and treatment and test for genes that significantly differ in their treatment effect due to sex using the interaction term, sex colon treatment, as the last term in the formula. For more information about specifying complex designs, I recommend reading through the DESeq2 vignette or the Bioconductor support site.

8. DESeq2 workflow: Running

Once you have the DESeq2 object with the raw counts, metadata, and the appropriate design formula, you can perform model fitting with the function DESeq(). As it runs it will output the completed steps as it fills in the different slots in the DESeq2 object. The final DESeq2 object will contain all of the information needed for performing the differential expression testing between specific sample groups.

9. Let's practice!

Now let's try some examples.