Get startedGet started for free

Sequence files

1. Introducing ShortRead

Hey! welcome back, we will talk about two Bioconductor packages to explore sequence data quality! The examples used for this chapter are plant sequence files!

2. Plant genomes

Plant genomes are usually big datasets, so today we are going to explore a small genome model, Arabidopsis thaliana. This was the first plant species to be completely sequenced, having a genome size of 135 Megabase pairs.

3. Sequencing companies

We are living in the time, where large-scale DNA sequencing is used to answer biological questions revolving around gene expression, mutations, hereditary conditions, and more. Since the cost of sequencing is steadily decreasing, the volume of data is steadily increasing. Different technologies for sequencing are developed by various sequencing companies. In the next examples, we will work with Illumina sequences, as Illumina continues to cover about 50% of sequencing projects worldwide.

4. fastq vs fasta

How do we store sequences? We do so, using text. There are two main text formats fastQ and fastA, the main difference is that fastQ files include quality encoding per sequenced letter. Both formats are used to store DNA or protein sequences together with sequence names. In detail, fastQ files are the standard for storing large-scale sequencing also called high-throughput sequencing. Each sequence read on a fastq file will be described in four lines. The first starts with an '@' sign and a sequence identifier or description. Then comes the raw sequence string. Followed by line 3 with a '+' sign and the sequence identifier. Finally, line 4 encodes the quality values of the sequence, with one encoding value per sequenced letter. Common file extensions are fastq or only fq. A fasta file contains two lines per sequence read. The first line starts with the right arrow and a unique sequence identifier and the second line, the raw sequence string. Common file extensions are fasta, fa, or seq.

5. fasta

ShortRead provides us with readFasta() which reads all FASTA-formatted files in a directory Path followed by a pattern. It can read compressed or uncompressed files. It returns a single object representation of class ShortRead. This class stores and manipulates uniform-length short read sequences and their identifiers. Use methods with class ShortRead to get a list of accessors. Lastly, writeFasta() writes an object to a single file given a file name. It can also compress on the fly.

6. fastq

Similarly, readFastq() reads all FASTQ-formatted files in a directory Path followed by a pattern. It functions like readFasta() with two additional arguments, qualityType and filter. Fastq files include sequence quality on the fourth line. The encoding of quality depends on the technology and the version used. Again, use methods() to see the available accessors of this class. writeFastq() writes an object of class ShortReadQ to a single file, additionally, it can append new sequences to an existing file and save a compressed version with the extension dot-gz. Be mindful to not forget adding the extension to the name.

7. fastq sample

Sequence files can hold from one sequence to millions of sequences! Often, you will like to work with a subset of these sequences. When sampling, usually is a good idea to set the seed to collect the same sample during re-runs. The function FastqSampler() draws a subset from a fastq file, with a given length, in this example 500 reads. Then, yield is a function to extract the sample from the stored file. The time of sampling might differ depending on the size of the file. After this, you can also explore other parameters and similar functions on your own, for example, length.

8. You are ready!

Time to put this into practice, you are ready to work with real sequence files!