Differential gene expression overview

1. Differential gene expression overview

Differential gene expression analysis is a powerful technique to determine whether genes are expressed at significantly different levels between two or more samplegroups. We will use the DESeq2 package to model the gene counts and identify differentially expressed genes.

2. Differential expression analysis: Goal

In this image, we see a heat map of genes as rows colored by number of counts. These genes represent the genes with large expression differences or fold changes between sample groups. To determine which genes are differentially expressed, one might ask 'why not just identify the genes with the largest fold changes in expression between sample groups?'.

3. Differential expression analysis: Goal

To get at the answer, let's observe the plot of normalized counts for gene A. The points represent the gene A expression levels for five biological replicates for 'untreated' and 'treated' conditions. The mean expression for the 'treated' condition is over twice that of the untreated. However, there appears to be greater variation in the 'treated' condition and the difference in expression may not be significant. We need to account for variation in the data when we determine whether genes are differentially expressed. Therefore, the goal of differential expression analysis is to determine for each gene whether the differences in expression between groups is significant given the amount of variation within groups, or between the biological replicates.

4. Introduction to dataset

To explore the workflow, we will be using a publicly available RNA-Seq dataset from Gerarduzzi et al from the journal JCI Insight. In this paper, the goal of the RNA-Seq experiment was to explore why mice over-expressing the Smoc2 gene, or producing more Smoc2 mRNA than normal, are more likely to develop kidney fibrosis.

5. Introduction to dataset: Smoc2

Smoc2, or Secreted modular calcium-binding protein 2, has been shown to have increased expression in kidney fibrosis, which is characterized by an excess of extracellular matrix in the space between tubules and capillaries within the kidney. However, it is unknown how Smoc2 functions in the induction and progression of fibrosis.

6. Introduction to dataset: Experimental design

There are four sample groups being tested: normal, control mice, referred to as wild type mice, with and without fibrosis and Smoc2 over-expressing mice with and without fibrosis. There are three biological replicates for all normal samples and four replicates for all fibrosis samples. Initially, we will explore the effect of fibrosis on gene expression using 'Wild type' samples during lectures and 'Smoc2 over-expression' data during exercises.

7. RNA-Seq count distribution

To test whether the expression of genes between two or more groups is significantly different, we need an appropriate statistical model. An appropriate statistical model is determined by the count distribution. When we plot the distribution of counts for a single sample, we can visualize key features of RNA-Seq count data, including a large proportion of genes with low counts and many genes with zero counts. Also note the long right tail, which is due to there being no limit for maximum expression in RNA-Seq data. If there was no expression variation between biological replicates, a frequently used count distribution known as the Poisson distribution, would be an appropriate model. But, there is always biological variation, and this additional variation present in RNA-Seq data can be modeled well using the negative binomial model, which we will be using as part of DESeq2.

8. Preparation for differential expression analysis: DESeq2 object

To start the differential expression analysis we use the `DESeqDataSetFromMatrix()` function, which takes a raw count matrix as input, along with the metadata and a design formula to create the DESeq2 object. The design formula given should contain major expected sources of variation to control for and the condition of interest as the last term in the formula. If the raw count data is a Summarized Experiment from the htseq-count tool, or generated by pseudo-alignment tools, DESeq2 has other functions to use to create the DESeq2 object as detailed in the vignette.

9. Preparation for differential expression analysis: metadata

In addition to our raw counts, we require sample metadata. At the very least, we need to know which of our samples correspond to each condition. To generate our metadata, we create a vector for each column and combine the vectors into a data frame. The sample names are added as the row names.

10. Preparation for differential expression analysis: metadata

After we have the counts and metadata files, we can start the differential expression analysis workflow.

11. Let's practice!

Let's practice exploring counts and getting our files ready for analysis.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.