Get startedGet started for free

Pre-process the data

1. Pre-process the data

In this final chapter, you will combine everything you have learned to perform an end-to-end differential expression analysis.

2. Mechanism of doxorubicin-induced cardiotoxicity

Doxorubicin is a commonly-prescribed cancer drug. Unfortunately one of its side effects is cardiotoxicity. In their 2012 study, Zhang and colleagues tested the hypothesis that doxorubicin damages heart cells by binding to the protein topoisomerase-II beta, or Top2b. They performed a 2x2 factorial experiment. They used two types of mice: genetically normal wild type mice, and Top2b null mice which had Top2b deleted specifically in their heart cells. They treated the mice with doxorubicin or the control solution, PBS. If doxorubucin requires Top2b to exert its cardiotoxic effect, the Top2b null mice should not be affected by doxorubicin treatment. You will analyze the data to test this hypothesis. The data contains measurements for 29,532 genes and 12 mice, with 3 replicates for each combination of the two factors.

3. Inspect the features

First you will need to pre-process the features, or genes. Using `plotDensities`, you can see that this data needs further processing before hypothesis testing. Because most of the data lies near zero with a very long right tail, the data set likely contains measurements for many genes that are not relevant in the mouse heart, and should be removed. Note that instead of disabling the legend, I instead color the lines by the genotype variable, and place the legend in the top right of the plot.

4. Pre-processing steps

Recall the 3-step method for pre-processing the features. Log transformation to view the entire distribution, quantile normalization to transform each sample to the same empirical distribution, and filtering to remove irrelevant genes.

5. Sanity check: Boxplot of Top2b

After the features have been pre-processed, you'll perform a sanity check. The Top2b null mice have had Top2b deleted from their heart cells, so before continuing you'll confirm that they have lower expression of Top2b using a boxplot. Recall from chapter 1 that you can create a boxplot using the formula notation. In this case the expression of Top2b as a function of the genotype.

6. Check sources of variation

Lastly, you'll check the sources of variation in the experiment using `plotMDS` to perform principal components analysis. Ideally the samples will cluster by their experimental group. If they didn't, you would need to consult with the experimentalists about any potential batch effects. In a 2x2 factorial experiment, there are 4 groups. If the Top2b mice are protected from the effect of doxorubicin, how many clusters of samples do you anticipate?

7. Let's practice!

Now it's your turn to put this strategy into action.