Get startedGet started for free

Visualizing the results

1. Visualizing the results

After performing the differential expression tests, you will need to visualize the results.

2. Inspecting the results

Recall from chapter 2 that you ended the analyses by counting the number of differentially expressed genes. While that is the end of the test, it is just the start of exploring the results. In this video I will use the results of the breast cancer data comparing tumors that are positive or negative for the estrogen receptor. The limma function `topTable` will return the top differentially expressed genes. The first 3 columns are the feature data from the original ExpressionSet object, and include the gene symbol, ID in the entrez database, and the chromosomal location. The next columns are the log-fold change in expression between the two groups, the average expression, the moderated t-statistic, and the p-value. The next column is the adjusted p-value to control for testing multiple hypotheses, which by default is computed using the Benjamini-Hochberg false discovery rate, or FDR. Lastly, the column B is the log-odds, an alternative metric to the p-value for assessing if a gene is differentially expressed.

3. Obtain results for all genes

Viewing the top DE genes is convenient for a quick assessment, but to more thoroughly investigate the results, you'll want the summary statistics for every gene. To do this, pass the number of rows in the `fit2` object to the `topTable` argument `number`. Furthermore, disable sorting by statistical significance by setting the `sort.by` argument to `"none"`. This will maintain the same row ordering in the results as the original ExpressionSet object, making it easier to compare and/or combine the results with the input data.

4. Histogram of p-values

One useful visualization is a p-value histogram. Under the null hypothesis of no differential expression, you expect the p-values to be uniformly distributed. Here I simulate 10,000 draws from a uniform distribution and plot the histogram using the R function `hist`. On the other hand, if you find lots of differentially expressed genes, you expect to see a right-skewed histogram with a large peak close to zero due to the many statistically significant p-values. Plotting the p-values from the breast cancer study, you see many low p-values. If you observe any deviation from one of these two patterns, you should review your code, especially your design and contrasts matrices, to ensure you have tested the correct hypotheses.

5. Volcano plot

Another common visualization is the volcano plot, which you can create with the limma function `volcanoplot`. The first argument is the fitted model object. Optionally you can label the top DE genes. Here I set the argument `highlight` to 5 to label the top 5 most significant genes. I also need to pass a vector of the labels to use for these genes. The fitted model object `fit2` has a data frame `genes` that contains the feature data, and I get the gene symbols from the column "symbol". On the x-axis is the log-fold change between the ER negative and positive tumors. On the y-axis is the log-odds of differential expression. The higher the log-odds, the more likely the gene is differentially expressed, which is why the top 5 labeled genes are at the very top of the plot. Note the typical shape. Volcano plots should always have this shape because genes that have a larger log-fold change are more likely to be differentially expressed. However, always make sure to note the range on the x and y axes because this shape does not guarantee that the genes are differentially expressed.

6. Let's practice!

Now it's your turn to visualize the differential expression results from the leukemia study.