Get startedGet started for free

Cleaning ChIP-seq data

1. Cleaning ChIP-seq data

Now that you know how to import the data into R it is time to think about the steps necessary to clean-up the data. This will help you to remove artefacts and reduce the amount of noise inherent to these data.

2. Common Problems

Several different mechanisms can lead to the apparent accumulation of reads in parts of the genome that don't actually contain binding sites for the protein of interest. For example, there are certain sequences throughout the genome, known as repeats, that occur over and over again. The origin of reads produced by any one of these repeated occurrances is difficult to pinpoint.

3. Common Problems

This is particularly problematic if the reference sequence and the actual genome of a sample differ in the number of repeat occurances. Any extra copies of the repeat in the sample genome are bound to generate reads that are falsly attributed to one of the copies that is present in the reference.

4. Common Problems

You will see related problems with low complexity regions of the genome, especially close to the ends of the arms of a chromosome. Because there is a lot of sequence similarity over extended regions, the origin of reads is difficult to determine. To make things worse, the quality of the reference sequence in these regions is also likely to be low.

5. Blacklisted Regions

Many regions that tend to accumulate incorrectly mapped reads are known and can be systematically excluded from the analysis. Here is an example of one such region close to one end of chromosome 1. As you can see it contains several very pronounced peaks that are very likely to be artifacts rather than actual protein binding sites.

6. Amplification Bias

Another unrelated, issue arises because of the way DNA extracted from cells is processed prior to sequencing. It is usually necessary to create copies of the extracted fragments in order to obtain enough DNA for sequencing but some fragments will produce more copies than others. This means that some fragments will produce multiple reads, which can pile up to give the appearance of a peak in coverage.

7. Quality Control Reports

It is useful to obtain summaries of all these potential problems in a systematic way across all samples in a study. Several tools are available that will compute quality metrics and summerise them in a report. One such tool is the R package *ChIPQC*. This produces an HTML report in your working directory with standard quality metrics for all samples in your study presented as a series of tables and plots.

8. Preparing input files

All that you need to create the report, in addition to the BAM and Bed files, is a comma separated file containing a table with sample information. This table matches BAM and BED files to the sample ID and assigns each sample to a group based on the combination of 'Factor', 'Condition', 'Tissue', and 'Treatment' values provided. These groupings are used by *ChIPQC* to generate summaries for groups of samples.

9. Cleaning the Data

So now that you know about the problems with the data, how should you deal with them? It is common to group all reads that share the same mapping coordinates and retain only one read alignment per group. This guards against amplification bias.

10. Cleaning the Data

Similarly, reads that map to more than one location in the genome or have a low mapping quality, which generally is an indication that their alignment may be incorrect, are typically removed prior to peak calling.

11. Cleaning the Data

Some peak callers have the capability to ignore reads in blacklisted regions and will not produce any peak calls for these. If this isn't the case, you should remove peaks in known problem regions prior to any further analysis. A suitable set of blacklisted regions is available from the ENCODE project, an international consortium that aims to create a comprehensive catalogue of functional elements in the human genome.

12. Let's practice!

Let's take a look at how we can do this in R.