1. Normalizing and filtering
Now that you are familiar with the limma pipeline, it's time to work with messier data.
Thus far you haven't been concerned with the status of the input data because you've been given analysis-ready data. Most of the time you are unlikely to be so lucky.
2. Pre-processing steps
The most thorough pre-processing of a data set will require domain-specific knowledge of the organism being studied and the technique used for measurement. To learn about pre-processing specific data sources, check out the Bioconductor workflow courses on DataCamp. In this lesson, you'll learn a generic strategy for pre-processing genomic data that can be widely applied, and should be sufficient for any first-pass analysis of a new data set. The 3 steps are log transformation, quantile normalization, and filtering.
3. Visualization
But before learning about these steps, you first need to learn how to visualize the distribution of gene expression levels for all your samples. This is critical for diagnosing the current state of the data and if any additional pre-processing steps need to be applied.
To do this, you will use the limma function `plotDensities`. A density plot is similar to a histogram, but the data is smoothed to be continuous. This facilitates visualizing many samples at once, which wouldn't be possible with histograms.
Its first argument is the ExpressionSet object. When you have many samples, you'll want to disable the automatic legend since it may obscure the plot.
From this plot of the raw Arabidopsis data, you can't see much at all because all the density lies near zero.
This is because while you have measured most of the genes in the genome, only a subset of these genes are relevant to the system being studied. When many of the genes are measured to be almost zero, and a subset have very high levels, you will get a very right-skewed distribution.
4. Log transform
To better visualize the entire distribution, you can log transform the data. This has the practical effect of increasing the distance between the small measurements and decreasing the distance between the large measurements.
As the example here demonstrates, on the log scale the difference between 100 and 1 is identical to the difference between .1 and .001.
The R function is named `log`, and by default it computes the natural logarithm. To apply it, you pass it the expression matrix using `exprs`, and update the ExpressionSet object by re-assigning the result to the expression matrix.
Plotting again you can now see the distributions much better.
5. Quantile normalize
However the distributions are not the same across the samples. As you learned in the first lesson of this course, genomic techniques measure relative abundance. Thus these large-scale differences are not meaningful, but arise from technical artifacts inherent to all genomic techniques. To remedy this, you can perform quantile normalization with the limma function `normalizeBetweenArrays`. This converts each sample to have the same distribution based on quantiles that are empirically computed as the average quantiles across all the samples.
6. Filter genes
Lastly, those lowly expressed genes are still creating a right-skew in the data. One option to remove them is to choose a cutoff that excludes the peak of genes with low expression levels, in this case estimated to be around 5.
You can determine the genes to remove by calculating the mean of each row of the expression matrix with `rowMeans` and checking if it is greater than 5. Here this is saved as the logical vector named `keep`. Subsetting the rows of the ExpressionSet and re-visualizing reveals the nicely distributed data ready for analysis.
7. Let's practice!
Now it's your turn to pre-process the raw data from the Populus study.