1. DESeq2 model
Now that we have run the DE analysis, we could explore our results. However, before proceeding, we should explore how well our data fit the model.
2. DESeq2 model
The goal of the differential expression analysis is to determine whether a gene's mean expression between sample groups is different given the variation within groups. This is determined by testing the probability of the log2 fold changes between groups being significantly different from zero.
The log2 fold changes are found by the log of the one sample group mean, shown here as the treatment group, divided by the mean of the other sample group, shown here as the control group.
Therefore, to model the counts requires information about the mean and variation in the data. To explore the variation in our data, we will observe the variance in gene expression relative to the mean.
Variance is the square of the standard deviation, representing how far away the expression of the individual samples, as shown by the dark red and blue circles, are from the means, shown in pink and light blue.
3. DESeq2 model - mean-variance relationship
For RNA-Seq data, the variance is generally expected to increase with the gene's mean expression.
To observe this relationship, we can calculate the means and variances for every gene of the normal samples using the apply() function.
4. DESeq2 model - dispersion
Then we can create a data frame for plotting with ggplot2.
We plot the mean and variance values for each gene using log10 scales. Each black dot represents a gene.
5. DESeq2 model - dispersion
We see the variance in gene expression increases with the mean. This is expected for RNA-Seq data.
Also, note how the range in values for variance is greater for lower mean counts than higher mean counts. This is also expected for RNA-Seq count data.
A measure of the variance for a given mean is described by a metric called dispersion in the DESeq2 model.
The DESeq2 model uses dispersion to assess the variability in expression when modeling the counts.
6. DESeq2 model - dispersion
The DESeq2 model calculates dispersion as being indirectly related to the mean and directly related to the variance of the data using the formula displayed.
So, an increase in variance will increase dispersion, while an increase in mean will decrease dispersion. For any two genes with the same mean expression, the only difference in dispersion will be based on differences in variance.
To check the fit of our data to the DESeq2 model, it can be useful to look at the dispersion estimates.
7. DESeq2 model - dispersion
To plot the dispersions relative to the means for each gene, we can use the plotDispEsts() function on the DESeq2 object.
Each black dot is a gene with associated mean and dispersion values. We expect dispersion values to decrease with increasing mean, which is what we see.
With only a few replicates for RNA-Seq experiments, gene-wise estimates of dispersion are often inaccurate, so DESeq2 uses information across all genes to determine the most likely estimates of dispersion for a given mean expression value, shown with the red line in the figure. Genes with inaccurately small estimates of variation could yield many false positives, or genes that are identified as DE, when they are really not.
Therefore, the original gene-wise dispersion estimates, shown as the black dots in the figure, are shrunken towards the curve to yield more accurate estimates of dispersion, shown as blue dots.
The more accurate, shrunken dispersion estimates are used to model the counts for determining the differentially-expressed genes.
Extremely high dispersion values, shown surrounded by blue circles, are not shrunken, due to the likelihood that the gene may have higher variability than others for biological or technical reasons and reducing the variation could result in false positives.
The strength of the shrinkage is dependent on the distance from the curve and sample size. Larger numbers of replicates can estimate the mean and variation more accurately, so yield less shrinkage.
8. DESeq2 model - dispersion
Worrisome plots would include a cloud of data that doesn't follow the curve or dispersions that don't decrease with increasing mean. These problems can often be explained by sample outliers or contamination. Examples of worrisome dispersion plots are shown in the figures.
9. Let's practice!
Time to explore the smoc2 data.