1. Accounting for technical batch effects
After pre-processing the genes, next you need to check the samples.
2. What are technical batch effects?
The biggest concern for your data is the presence of technical batch effects. These are artifacts that arise from the fact that every batch of an experiment is slightly different. This is true of every type of empirical result, not just functional genomics. However, because functional genomics experiments collect data on so many features at once, it is possible to identify them.
For this reason, it is critical that you balance your variables of interest across each batch. For example, if you process all of the tumor samples in one batch and the non-tumor samples in a separate batch, then the data is not worth analyzing. You would likely find a lot of signal, but none of it would be trustworthy.
However, if your experiment is properly balanced, you can remove the effect of batch effects to improve the power to detect the signal of interest.
3. Diagnosing technical batch effects
To investigate batch effects you can use dimension reduction techniques like principal components analysis and multidimensional scaling. These techniques reduce the representation of each sample from a vector of thousands of measurements to a vector the length of the number of samples. Furthermore, this reduced vector captures the largest sources of variation in the data, starting with the largest source, and each one is orthogonal, or independent, to the next. You will mostly focus on the first two dimensions because these are the largest sources of variation and convenient to visualize. You want to determine if these sources of variation are correlated with the variables of interest or batch effects.
4. plotMDS
limma has the function `plotMDS` for performing multidimensional scaling. It has convenient defaults for genomics. It subsets to only include the 500 most variable genes in the data, which makes the code run much faster and avoids strange results caused by genes with little variation.
Here I plot the results from one of my own experiments. I pass the ExpressionSet and the phenotype column "time" to label the points. I also set `gene.selection` to be common, which makes `plotMDS` perform a more traditional principal components analysis.
I expected the early and late samples to separate along the first PC on the x-axis, but surprisingly, the early and late samples separated along the second PC on the y-axis. This suggested that the largest source of variation in the experiment was due to a technical batch effect.
5. removeBatchEffect
Because I had designed the study so that the samples were balanced across the experimental batches, I was able to remove this unwanted variation using the limma function `removeBatchEffect`. This fits a linear model and returns the residuals. Here I passed the discrete variable batch to the argument `batch` and the continuous variable RIN, a measure of RNA quality, to the argument `covariates`.
Visualizing the corrected data, now the early and late samples are separated by PC1 on the x-axis.
One caveat of this approach. Using `removeBatchEffect` is ideal for exploratory data analysis and especially visualizations. However, for the actual statistical analysis, it is better to include the batch variable as a coefficient when constructing your design matrix.
6. Olfactory stem cells
In the following exercises, you will check for batch effects in a data set of olfactory stem cells that received 7 treatments. The samples were processed in 4 batches, and the experiment was perfectly balanced. In other words, each batch had a sample from each of the 7 groups.
These data were generated by Osmond-McLeod and colleagues, and the data was used by Oytam and colleagues to develop the Bioconductor package, Harman, which has methods for batch correction.
7. Let's practice!
Now it's your turn to investigate batch effects.