1. Unsupervised clustering analyses
Now that we have our normalized counts, we can continue on in the differential expression analysis workflow.
2. Unsupervised clustering analyses
With our counts normalized for library size, we can now compare the counts between the different samples. We can explore how similar the samples are to each other with regards to gene expression to assess the quality of our experiment. To do this we use visualization methods for unsupervised clustering analyses, including hierarchical clustering heatmaps and principal component analysis or PCA.
We perform these QC methods to get an idea of how similar the biological replicates are to each other and to identify outlier samples and major sources of variation present in the dataset.
3. Unsupervised clustering analyses: log transformation
When using these visualization methods, we should first log transform the normalized counts to improve the visualization of the clustering. For RNA-Seq data, DESeq2 uses a variance stabilizing transformation (VST), which is a logarithmic transformation that moderates the variance across the mean.
We can transform the normalized counts by using the DESeq2 vst() function on the DESeq2 object. The blind=TRUE argument specifies that the transformation should be blind to the sample information given in the design formula; this argument should be specified when performing quality assessment.
4. Hierarchical clustering with correlation heatmaps
Hierarchical clustering with heatmaps is used to assess the similarity in gene expression between the different samples in a dataset. This technique is used to explore how similar replicates are to each other and whether the samples belonging to different sample groups cluster separately.
The heatmap is created by using the gene expression correlation values for all pairwise combinations of samples in the dataset, with the value 1 being perfect correlation.
The hierarchical tree shows which samples are more similar to each other and the colors in the heatmap depict the correlation values. We expect the biological replicates to cluster together and sample conditions to cluster apart.
Since the majority of genes should not be differentially expressed, samples should generally have high correlations with each other. Samples with correlation values below 0-point-8 may require further investigation to determine whether these samples are outliers or have contamination.
5. Hierarchical clustering with correlation heatmaps
To create a hierarchical heatmap, we are going to extract the VST-transformed normalized counts as a matrix from the vsd object using the assay() function.
Then, we can compute the pairwise correlation values between each pair of samples using the cor() function. Using View() we can view the correlation values between each of the sample pairs.
6. Hierarchical clustering with correlation heatmaps
After generating the correlation values, we can use the pheatmap package to create the heatmap.
The annotation argument selects which factors in the metadata to include as annotation bars. We use the select() function from the dplyr package to select the condition column in the wildtype metadata.
7. Hierarchical clustering with correlation heatmaps
The output from the heatmap shows that the biological replicates cluster together and the conditions separate. This is encouraging since our differentially expressed genes between our conditions are likely to be driving this separation.
Also, all correlation values are expectedly high without any outlier samples.
If our replicates did not cluster as expected, we could plot the heatmap with all of the metadata and see whether any other factor corresponds to the separation of the samples. If so, you would want to see if you get similar results with the Principal Component Analysis, which we will be covering in the next lesson. If identified by both methods, we can account for it in the DESeq2 model.
If you have an outlier identified, you would also want to check the PCA. If you see the outlier with both methods, you could investigate that sample more and decide whether to remove it from the analysis.
8. Let's practice!
Now it's your turn to explore these quality metrics.