Get startedGet started for free

Introduction to differential binding

1. Introduction to differential binding

In the previous chapter you learned about ChIP-seq data, how to import them into R and how to visualise the data and prepare them for analysis. Now it is time to get into the meat of that analysis.

2. Understanding the Difference

Remember that the data we are looking at were collected to understand the difference between prostate cancer cells that respond to treatment and those that have become resistant. Therefore the goal of our analysis is identify molecular mechanisms that cause this difference in response, which can lead to very different outcomes for the patient.

3. Comparing samples

The first step in this process is to identify any substantial differences between the two groups of samples. In this chapter you will learn how to identify differences between samples at large and small scales. This will allow you to answer questions like, "which of these samples show protein binding patterns that are generally similar to each other", as well as, "how do the protein binding patterns differ between these two groups of samples?"

4. Global differences

One way to approach answering the first of these questions is to use principle component analysis. If you haven't heard of principle component analysis, or PCA, before all you need to know right now is that it is a method used to uncover some of the underlying structure in the dataset. It identifies the directions, or principle components, with the most variation between data points. To see how this works, take a look at this example. Here we have to groups of observations plotted in three dimensional space. Depending on how we look at the data, differences between the groups are more or less obvious.

5. PCA

Using the first two principle components we can define a two dimensional plane that passes through the cloud of data points, minimizing the overall distance between points and the plane as much as possible.

6. PCA - part 2

Now, if we rotate the data such that we are looking straight at this plane we get a view that highlights the main differences between data points.

7. PCA - part 3

By projecting the points onto that plane we can then create a two dimensional scatter plot. This is called a PCA plot.

8. PCA plots for ChIP-seq data

The *ChIPQC* package makes it really easy to create PCA plots for ChIP-seq data. Using the output of `ChIPQC()` function we can create a principal component plot of the samples in our data set. Note that we have to create a consistent set of peaks across all samples for this to work. The function `dba.count()` from the *DiffBind* package can provide us with a suitable set of consensus peaks. The `summits` argument determines the width of the resulting peaks. Once you have created a consensus peak set and corresponding read counts with `dba.count()`, you can pass this directly to the `plotPrincomp()` function for plotting.

9. Hierarchical clustering

Another approach to exploring the structure of our dataset is to cluster samples. This clustering is based on the observed read counts for each peak. There are many ways to do this. We will look at a hierarchical clustering here. Hierarchical clustering uses the pairwise distances between samples to build a hierarchy, or tree, of samples, known as a dendrogram. You can compute the required distances using the `dist()` function. This computes the distance between the rows of a matrix. If you want the distance between samples, don't forget to transpose the data first. The `hclust()` function will produce the hierarchical structure we are interested in.

10. Heatmaps

A useful way to employ this clustering is to use it to reorder samples and peaks based on similarity. This will group similar samples and peaks with similar coverage together, making it easier to spot similarities and differences between samples when the data are plotted. Note the large blocks of yellow and blue. Such a plot is known as a heatmap. The *DiffBind* package provides the function `dba.plotHeatmap()` to facilitate this. It requires a *DBA* object with peak calls for plotting. Here I've also set the `maxSite` argument to the total number of peak calls and `correlations` to `FALSE`. This ensures all peaks are plotted instead of correlations between samples.

11. Let's practice!

Now let's try some examples.