Enrichment testing

1. Enrichment testing

Next you will learn how to test for enrichment of gene sets in differential expression results.

2. Interpreting the results

Continuing with the breast cancer data, there were over 11,000 differentially expressed genes. How can you start to make sense of these results? Examining the genes one-by-one would not only be tedious, but also ineffective because it's impossible for you to know how all these thousands of genes are related.

3. Biological databases

Fortunately there exist biological databases that curate sets of related genes. One is KEGG, which curates sets of genes that interact in the same biological pathway, for example photosynthesis or protein transport. The second we'll discuss is gene ontology categories, or GO. These are hierarchical categories that range from the very broad to the very specific.

4. Enrichment testing

Thus you want to know if the differentially expressed genes in your experiment are overrepresented more than expected by chance in any known sets of genes, which is known as enrichment. A straightforward method is to use Fisher's exact test to test for imbalances in the following contingency table. Imagine a hypothetical experiment with 1000 genes, of which 100 were DE. If 10 of the 100 DE genes are in a particular gene set, but so are 100 out of all 1000 genes, then there is no enrichment in this gene set over background. Consistently, the Fisher's Exact Test reports an odds ratio of 1 and a p-value of 1.

5. Enrichment testing

In contrast, if 30 of the 100 DE genes were in that gene set, this would be an enrichment, since 30% is larger than the background rate of 10%. Consistently, the Fisher's Exact Test reports an odds ratio of 3.85 and a very low p-value.

6. Testing for KEGG enrichment

limma provides functions for enrichment testing. To connect the genes you tested with those in the database, you need to use a common ID. The limma functions require the gene IDs from the entrez database. The feature data from the ExpressionSet object is stored in the fitted model object as the data frame `genes`, which you can access like a list using the dollar sign notation. The following line extracts the entrez column and saves it in a variable of the same name. The KEGG enrichment is performed with `kegga`. You pass it the fitted model object, the vector of gene IDs, and the species abbreviation, HS for _Homo sapiens_. You view the top enriched pathways with `topKEGG`. This displays the ID and name of the pathway, the number of genes that were in the set, the number that were up and down regulated, and separate p-values for enrichment of up- and down-regulated genes. The top result is the cell cycle pathway, which makes sense since this is often disregulated in cancer cells.

7. Testing for GO enrichment

The process for testing enrichment of GO categories is similar. The function is `goana`, but the input arguments are identical. You can view the top results with `topGO`. Here you can specify the type of ontology. You can learn more about these by visiting the GO consortium website. I personally find the Biological Processes, or BP, the most informative. The top 3 GO categories are all about the immune system. This is a downside of the hierarchical nature of GO. The same set of genes can be the underlying signal for many similar categories.

8. Caveats

Lastly some caveats. Don't overinterpret enrichment results. Instead view them as validation of your experiment and as a tool to further investigate genes of interest. Be skeptical of the distinction between up and down-regulated. This assumes more than we likely know about the role of the genes in the pathway. Only include the genes that were tested, which limma does by default. If the background is all genes in the genome, this will bias the results. Lastly, there are more sophisticated methods you can try.

9. Let's practice!

Now it's your turn to do enrichment testing on the leukemia data.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.