1. Sequence quality
You have learned that short reads can be stored in different formats. The most used is the fastq format, which contains, both the sequenced bases and their quality values. Now you are going to learn new functions from ShortRead to assess the quality of your sequenced reads.
2. Quality scores - Phred table
FASTQ files, store quality scores to register error probabilities, in other words, the accuracy of the sequencing, as shown in this Phread qualities table.
Here you can see that scores equal to or above 30 are 99-point-9% accurate, hence are considered of good quality! Because only one base in 1000 might be wrong.
3. Encoding - Phred +33
Let's now have a detailed look at each encoding and the score that corresponds to it.
Phred scores are represented efficiently with encoding, as a single ASCII character.
Originally, they were used as Sanger FASTQ standard, and now are the de facto standard for representing sequencing quality in different platforms.
The encoding translated to scores usually ranges from 2 to 40. However, you can see higher scores.
You should also keep in mind that other encodings existed before this standard and varied depending on the sequencing platform.
4. fastq quality
fastq files encode quality scores on a class FastqQuality as a BStringSet using the function quality() you can obtain the Quality of a sequence.
For this example, you have Illumina sequences using Phred+33 standard.
5. Exploring quality encoding and scores
**Every** sequence analysis should start with a quality assessment.
Let's check the first read in the sequence using sread()
Then we can see the quality of each letter with quality() which will give us the encoding values
To transform the encoding into scores we can use the function PhredQuality() given the quality of the reads, then convert the pred quality pq into an Integer list to get numeric scores.
Remember, a score of 30 is considered of good quality as it means the accuracy of base call is 99-point-9%
From this example, with the exception of 3 base calls (with scores 29, 27, and 18), all of the other bases are of very good quality!
6. Quality assessment
The qa() function from ShortRead is super nifty because it gives you lots of summary assessments about your sequence file or files. qa() not only works with the files that you have already read, but it can also be used with files to be read given a directory path and a pattern.
The Shortreadqqa class gives an overview of read-counts, base calls, read quality scores, base quality, most frequent sequences, and their distribution. You can call each of these by name using a list syntax call with double square brackets, to get a summary of each evaluation.
Or for a bigger picture, use the function report() with browseURL)() to see a collection of graphs and tables in a browser.
7. Alphabet by cycle
Using accessors for ShortRead classes we can explore in detail the sequences. Let's have a look at the nucleotide frequency per cycle and some transformations for further analysis.
First, take a look at the alphabet of the short read object and remember the DNA Biostrings alphabet. Then use the function alphabetbyCycle() and you will get all letter frequencies per cycle.
You will subset only the first four rows as those are the nucleotides A, C, G, T, and then transpose them.
Optionally convert this into a tibble and add cycle numbers, and voila! You have a ready to visualize tibble, of nucleotides by cycle.
8. Are you excited?
Are you excited to use these new functions? Go ahead and explore sequence quality with coding and graphs! Plus don't forget to have lots of fun!