Get startedGet started for free

The limma package

1. The limma package

Now you'll learn about the limma package for differential expression.

2. Advantages of the limma package

Using the limma package for differential expresssion has multiple clear advantages over writing your own bespoke code. First, testing thousands of genes individually would require writing a loop with boiler plate code, like counting variables, thus obscuring the intent of the analysis. Every limma function performs its action on every gene in the data set, making it easier for you to write the code and easier for others to understand your analysis. But it's not only more convenient for you, it's also better statistics! The loop in the code above treats every gene as a completely independent analysis. But because all the genes were measured in the same experiment, limma uses a statistical technique known as empirical Bayes to share information across the genes, which is especially helpful for studies with small sample sizes. Now empirical Bayes methods would deserve its own separate course, so I won't elaborate more, but know that not only are you getting convenience, but also sophisticated statistics. Lastly, limma has many functions for processing the data before and after performing the differential expression test. You'll learn about some of these in Chapter 3. To get started using the limma package, install it using BiocManager.

3. Specifying a linear model

Recall that the goal of the breast cancer study is to identify differentially expressed genes between ER-negative tumors in which the estrogren receptor protein is absent versus ER-positive tumors in which the estrogren receptor protein is present. Thus for each gene you need to fit the following linear model: The response variable Y is the expression level of the given gene, beta-not is the mean expression level of the gene in the ER-negative tumors, beta-one is the mean difference in expression level of the gene in the ER-positive tumors compared to the ER-negative tumors, the explanatory variable x-one indicates ER status, and epsilon models the random noise.

4. Specifying a linear model in R

You translate this statistical model to R code using the function `model.matrix` and the formula syntax you used to create boxplots. However, this time you only need to specify the explanatory variable. The result is called a design matrix, and limma will then apply this to every gene. Here I create a design matrix for the breast cancer data by including the column er as the explanatory variable. I don't need to surround it in quotes because I specify the source of the data is the phenotype data frame. Each column of the design matrix corresponds to a coefficient in the linear model. If the sample in a given row is modeled by this coefficient, then it has the value one, and zero otherwise. Looking at the first two samples, the intercept is 1 because it is 1 for all samples. The first sample is ER-negative and has zero for beta-one. Conversely, the second sample is ER-positive and has one. Summing the columns of the design matrix is a good sanity check. The number of samples modeled by the er coefficient should equal the number of ER positive samples, 209.

5. Testing with limma

Now that you have specified a design matrix, you can proceed with the standard limma pipeline. First load the package. Second fit the coefficients of the model with `lmFit` by passing it the ExpressionSet object and design matrix. Third calculate the t-statistics by passing it the fitted model object to the `eBayes` function. Lastly, count the number of genes with higher or lower expression in ER-positive tumors compared to ER-negative. There are 5,004 upregulated genes, 6,276 downregulated genes, and 11,003 genes whose expression level is not different between the tumors. Note that in the call to `decideTests`, I specified the coefficient "er" since it wouldn't be meaningful to test the intercept coefficient.

6. Let's practice!

Now it's your turn to build and test a model for the leukemia data.