1. Importing data
Now that you have a general idea of the steps involved in the ChIP-seq workflow it is time to take a closer look at the details.
2. Preparing the data
The first step in processing ChIP-seq data is to map the reads that come off the sequencing machine to the reference genome, followed by the identification of peaks in the resulting coverage profile. Both of these tasks are typically carried out with specialised tools prior to importing data into R. The data you'll be working with were mapped with BWA and peaks were called with MACS but a variety of alternative tools are available.
3. Handling sequence reads
Mapped sequence reads are typically stored in a binary file format. Each record consists of information about the read sequence and how it was mapped to the reference genome. This tells you which part of the reference genome is most similar to this read, how the two differ as well as several indicators of how reliable this alignment between read and genome sequence is.
The file format also supports a number of optional tags that allow additional information to be embedded with each alignment. While it would be possible to work with these alignment records directly, it is much more convenient to use dedicated tools that extract the relevant information.
4. Importing mapped reads into R
Bioconductor contains the *Rsamtools* package for this purpose. Rsamtools provides functions that facilitate indexing, reading, filtering and writing of BAM files. Once BAM files have been indexed reads can be imported into R and processed using several different packages. We will be working with the *GenomicsAlignment* package and use the `readGAlignment()` function. This returns a GAlignments object with alignment information for each read.
5. Importing selected regions
You may not want to load all reads from a BAM file. Instead, it often is a good idea to limit data import only to regions of interest, like peak calls or specific genes. The BamViews function allows you to restrict processing of BAM files to a list of predefined regions. You can process multiple BAM files in this way by passing a vector of file names to BamViews.
6. Importing peak calls
You can combine this approach with peak calls, which are loaded from BED files using the `import.bed()` function. Unsurprisingly, this expects the name of a BED file. Providing a genome identifier, *hg19* in this case, to the `genome` argument allows additional information to be added to the output automatically. You can then pass the resulting *GRanges* object to `BamViews()` to restrict processing to all the peaks in the input file.
7. Let's practice!
Now it's your turn to try this yourself.