1. DESeq2 results
In this lesson, we will explore our differential expression results.
2. DESeq2 results table
To get descriptions for the columns in the results table, we can use the mcols() function.
The first column is the mean value across all samples, followed by the shrunken log2 foldchanges, standard error of the fold change estimates, the Wald statistics output from the Wald test for differential expression, the Wald test p-value, and the Benjamini-Hochberg adjusted p-value.
3. DESeq2 results table
Now let's look at the values in the results table and identify the differentially expressed genes. To determine significant DE genes, we will be using the p-values adjusted for multiple test correction in the last column.
The reason for this is that for every gene tested with an alpha of 0-point-05, there is a 5% chance that the gene is called as DE when it is not, yielding false positives.
If we were to test the roughly 47,000 genes in the raw counts file, we would have about 5% or over 2,000 genes as false positives. It would be difficult to identify the true positives, or genes that are called DE when they truly are, from the false.
Therefore, multiple test correction is performed by DESeq2 using the Benjamini-Hochberg, or BH-method, to adjust p-values for multiple testing and control the proportion of false positives relative to true.
Using the BH-method and an alpha value of 0-point-05, if we had 1,000 genes identified as DE, we would expect 5% of the DE genes to be false positives, or 50 genes.
To reduce the number of genes tested, DESeq2 automatically filters out genes unlikely to be truly differentially expressed prior to testing, such as genes with zero counts across all samples, genes with low mean values across all samples, and genes with extreme count outliers.
We can see the filtered genes in the results tables represented by an NA in the p-adjusted column.
4. Significant DE genes - summary
DESeq2's summary() function provides the number of differentially expressed genes for our alpha level and information about the number of genes filtered.
Our results give over 10,000 genes as DE, which is the sum of the DE genes with log2 fold changes less than 0 and those with fold changes greater than 0.
This is a lot of genes to sift through. If we wanted to return the genes most likely to be biologically relevant, we could also include a log2 fold change threshold.
Oftentimes, a log2 fold change threshold isn't preferred. However, it can be helpful when dealing with such large numbers of DE genes.
5. Significant DE genes - fold-change threshold
To test for significant genes using both an alpha value threshold and a log2 foldchange threshold different from 0, we need to re-run the results function. Let's use a small 1-point-25-foldchange threshold, which equals 0-point-32 on the log2 scale, by adding the lfcThreshold argument to our results() function.
While using any log2 fold change cut-off increases the risk of losing biologically relevant genes, by using a very small log2 foldchange threshold, we are hoping to reduce the risk that the genes more biologically meaningful.
6. Significant DE genes - summary
Now that we have the results, we need to re-shrink the foldchanges, then run the summary() function again.
Now, we have returned just over 6,000 DE genes.
7. Results - annotate
To better understand which genes the results pertain to, we can use the annotables package to quickly obtain gene names for the Ensembl gene IDs using the table of gene annotations for the Grch38 mouse genome build.
8. Results - extract
To annotate the genes with gene names and descriptions, we need to first turn our results table into a data frame using the data-dot-frame() function.
Then, after changing the row names to a column, we can merge the gene names and descriptions with our results using the left_join() function and merging by Ensembl gene IDs. Now we have our entire results table.
9. Significant DE genes - arrange
To extract the significant DE genes, we'll subset the results table, using the subset() function, for genes with p-adjusted values less than the alpha value of 0-point-05.
We should see all p-adjusted values are less than 0-point-05 and log2 foldchanges are greater than the absolute value of 0-point-32.
We will use the arrange() function to order the genes by p-adjusted values to generate the final table of significant results.
We can explore this table for interesting or expected genes with a high probability of being related to kidney fibrosis.
10. Significant DE genes
If we look up many of these top genes, we will find known roles associated with fibrosis, which is an encouraging and exciting result for us.
11. Let's practice!
Now, it's time to practice.