Count normalization

1. Count normalization

Now that we have our DESeq2 object created with the raw counts and metadata stored inside, we can start the DESeq2 workflow.

2. DESeq workflow - normalization

The first step in the workflow is to normalize the raw counts to assess sample-level quality control metrics.

3. Count normalization

But what does it mean to normalize the raw counts? The raw counts represent the number of reads aligning to each gene and should be proportional to the expression of the RNA in the sample; however, there are factors other than RNA expression that can influence the number of reads aligning to each gene. We can adjust the count data to remove the influence of these factors on the overall counts using normalization methods. The main factors often considered during normalization of count data are library depth, gene length, and RNA composition.

4. Library depth normalization

Differences in library size between samples can lead to many more reads being aligned to genes in one sample versus another sample. In this example, sample A has nearly twice the reads, represented as small rectangles, aligning to each of the genes as sample B only because sample A has nearly twice the number of reads sequenced. Therefore, we need to adjust the counts assigned to each gene based on the size of the library prior to doing differential expression analysis.

5. Gene length normalization

Another normalization factor often adjusted for is gene length. A longer gene generates a longer transcript, which generates more fragments for sequencing. Therefore, a longer gene will often have more counts than a shorter gene with the same level of expression. In this example, gene X is twice as long as gene Y and, due to the difference in length, is assigned twice as many reads. Since DE analysis compares expression levels of the same genes between conditions, we do not need to normalize for gene length. However, if you were to compare the expression levels of different genes, you would need to account for lengths of the genes.

6. Library composition effect

When adjusting for library size, the composition of the library is also important. A few highly differentially expressed genes can skew many normalization methods that are not resistant to these outliers. In this image, we can see that the green DE gene takes up a large proportion of reads for Sample A. If we just divided our counts by the total number of reads, normalization for the majority of genes would be skewed by the highly expressed DE gene. For this reason, when performing a DE analysis, we need to use a method that is resistant to these outlier genes.

7. DESeq2 normalization

DESeq2 uses a 'median of ratios' method of normalization. This method adjusts the raw counts for library size and is resistant to large numbers of differentially expressed genes.

8. Normalized counts: calculation

To calculate the normalized counts with DESeq2, we can use the function estimateSizeFactors() on the DESeq2 object, dds_wt, and assign the output to a slot in the DESeq2 object, by re-assigning to dds_wt. DESeq2 will use these size factors to normalize the raw counts. The raw counts for each sample are divided by the associated sample-specific size factor for normalization. To view the size factors used for normalization, we can use the sizeFactors() function.

9. Normalized counts: extraction

Once the size factors have been calculated and added to the DESeq2 object, the normalized counts can be extracted from it. We can extract the normalized counts from the DESeq2 object using the counts() function while specifying that we would like the normalized counts with the normalized = TRUE argument. If the default was left as normalized = FALSE, then we would extract the raw counts from the object.

10. Let's practice!

Now it's your turn to practice normalizing counts and extracting them from the DESeq2 object.