1. Assessing enrichment
Now that you have learned how to import and clean ChIP-seq data it is time to take a closer look at how to properly aggregate these data to prepare for the analysis.
2. Extending reads
Remember that the reads in the dataset are generated by sequencing the ends of DNA fragments that the protein of interest is bound to. Typically, this will result in several reads from both ends of the fragment mapping to similar locations in the genome, clustering either side of the protein binding site. The signal becomes a lot clearer once reads are extended to the length of the full fragment. Reads from both ends now overlap and form a pronounced peak in read coverage over the binding site. This peak is now much easier to distinguish from the background noise of unrelated reads.
3. Extending reads
What does this look like with real data? Here is an example of a peak located on chromosome 20. The two coverage tracks at the top show read coverage for the forward and reverse strand respectively. At the bottom you can see the total coverage after reads have been extended to the mean fragment length.
4. Extending reads
How do you go about extending reads in R? After loading the data, the first step is to determine the average fragment length. You can extract this information from the QC report generated by the *ChIPQC* package using the `fragmentlength()` function. This takes the report object and returns the average fragment length for each sample. Be careful to use the information for the appropriate sample. With that information in hand reads can be extended with the very useful `resize()` function from the *GenomicRanges* package. This takes a *GRanges* object with read coordinates and allows you to specify the desired width of a fragment via the `width` argument. All reads will then be extended accordingly. Once that is done, you can compute the coverage as before.
5. Enrichment
Once you have extended reads to the average fragment length a reasonable question to ask is "How does coverage in peaks compare to coverage in other parts of the genome?" The first step in addressing this question is to partition the genome into short intervals. We'll be using 200 bp long ones. This provides a convenient unit for comparison. We can then assign each bin either to a peak, to a blacklisted region, or background.
6. Coverage for peaks
To do this in R you'll start by creating the 200 bp bins using the `tileGenome` function from the *GenomicRanges* package. You can then find the overlap between the peaks and these bins using the `findOverlaps` function. The `from` function helps us obtain the index of each bin that overlaps with a peak. Finally, you can count the number of fragments in each of the selected bins with the `countOverlaps` function.
7. Binned coverage function
For convenience, we can wrap this code up in a simple function. This accepts three *GRanges* objects, for reads, target regions, and bins respectively, and returns a *GRanges* object containing the coordinates bins overlapping the target regions. Each bin in the output is annotated with the number of reads that fall into this bin.
8. Coverage for blacklisted regions
You can then obtain counts for the blacklisted regions in exactly the same way.
9. Background coverage
One way to measure background coverage is to consider the coverage for all remaining bins after peaks and blacklisted regions have been removed. You can obtain the background bins by using the `subset()` function. The `countOverlaps()` function will then count the number of reads in each of these bins.
10. Coverage comparison
This boxplot shows the coverage distribution in bins assigned to background, blacklisted regions, and peaks respectively. It is a bit hard to see what is going on here because some of the blacklisted bins have enormous coverage.
11. Coverage comparison
If we zoom in a bit, things become clearer.
12. Coverage comparison
Now it is clear that the average coverage for peaks is higher than for either of the other categories. You already know that blacklisted regions can have artificially inflated coverage, which is why we exclude them from analysis. But what is going on with the background? Why do some of the bins have such high coverage? We'll look at that in more detail during the next chapter.
13. Let's practice!
First, let's try some examples.