1. Interpreting ChIP-seq peaks
In the previous chapter you have learned how to determine differences in peak intensity between groups of samples. Now it is time to think about how to interpret these results. What does it mean that this set of peaks is more pronounced in treatment resistant tumours than primary tumours? How can we use this information to gain a better understanding of the processes that drive the different response to treatment?
2. Interpreting peaks
To better understand the peak calls, it is useful to identify genes that are likely to be regulated by the binding of the transcription factor that produced the ChIP-seq peak.
3. Interpreting peaks
Remember that it is the changes in gene regulation that we are really interested in. So how do we know which gene is affected by a given binding site?
4. Interpreting peaks
Without additional information, it is impossible to know for sure. A commonly used heuristic is to assign peaks to the closest gene.
5. Annotating peaks
The first step in this process is to obtain information about the location of genes in the genome. Then we can proceed to assign peaks to the closest gene. This in turn allows us to create a list of genes with changes in protein binding between the two groups. Let's look at that process in a bit more detail.
6. Transcript Annotations
So how do we know where genes are located in the genome? Fortunately Bioconductor provides annotation packages that make this information readily available in R. The `TxDb` packages provide information about the location of all known transcripts and genes in a given genome. We will be working with the human genome build hg19. This uses Entrez IDs to identify genes. These IDs have the advantage of uniquely identifying genes but are not very human readable.
7. Additional Annotations
Additional annotation packages are available to translate Entrez IDs into other identifiers that may be more useful in communicating results. Gene symbols are shorthand, acronym-like, names commonly used to refer to genes. Here you can see an example of how Entrez IDs can be translated into these gene symbols. Be careful when working with gene symbols. It is not uncommon for the same gene to be associated with multiple symbols, which can lead to confusion. It is usually best to work with unique IDs, like Entrez IDs, for as long as possible.
8. Annotating Peaks
With the gene annotations in hand we are now in a position to annotate peaks with their closest gene. The `annoPeaks()` function from the *ChIPpeakAnno* package can take care of that. It requires two *GRanges* objects. One with peak coordinates, the other with annotations. In this case gene coordinates and symbols. It will then match them up according to the criteria specified by the other parameters. Here we require peaks to be within 5kb of the transcription start site, which I have specified using the `bindingType` and `bindingRegion` arguments.
9. Visualizing similarities and differences
It is always a good idea to plot your data. If you want to take a look at the extent to which the samples share peaks or corresponding genes one commonly used plot is the Venn diagram. A plot like this can be created with the `dba.plotVenn()` function, using The *DBA* object created during the earlier analysis. Here we are looking at the overlap between the two primary tumor samples. That is very convenient but only really useful for small numbers of samples.
10. Using UpSet plots
For larger datasets UpSet plots are likely to be more useful. Here is an example of the same data displayed as an UpSet plot, created with the *UpSetR* package. The `upset()` function expects a data frame that indicates for each peak in which samples it was present. The *DBA* object provides this in the `called` entry.
This type of plot remains readable and informative even for relatively large numbers of samples and is a good alternative to Venn diagrams. Here you can see the overlap between the same two samples we looked at with a Venn diagram before.
11. Let's practice!
Now let's try some examples.