Get startedGet started for free

Introduction to Biostrings

1. Introduction to Biostrings

Bioconductor is all about handling biological datasets in the most efficient way. As you get more familiar with your biological projects and experiments, you'll begin to notice how big datasets can be.

2. Biostrings

The Biostrings package implements algorithms for fast manipulation of large biological sequences. It is so important, that more than 200 Bioconductor packages depend on it. Hence, Biostrings is in the top 5% of Bioconductor downloads. It can be installed with the same install code we've seen previously.

3. Biological string containers

Biological datasets are represented by characters, and these sequences can be extremely large. Biostrings is a useful package because it implements memory efficient containers, especially for sub-setting and matching. Also, these containers can have subclasses. For example, a BString subclass for Big String can store a big sequence of strings.

4. Strings vs. Sets

The Biostrings package implements two generic containers, also known as virtual classes; these are XString and XStringSet, from which other subclasses will inherit. Any XString or its subclasses will hold one single sequence of a predefined alphabet. When you want to store and handle multiple sequences or collections, even if they have varying lengths, we can use a StringSet.

5. showClass()

To learn more about these classes and how they connect to one another, use the showClass() function, like in the example.

6. Biostring alphabets

BioStrings are biology-oriented containers and use a predefined alphabet for storing a DNA sequence, an RNA sequence, or a sequence of amino acids. The DNA_BASES alphabet has the four bases (A, C, G and T). The RNA_BASES replace the T for U) and the AA_STANDARD contains the 20 Amino Acid letters, each is built from 3 consecutive RNA bases. In addition, Biostrings alphabets are based on two code representations: IUPAC_CODE_MAP and AMINO_ACID_CODE which contains the bases plus extra characters and symbols. It is important to know these code representations so we know which kind of string we're using or need to use depending on the sequence alphabet.

7. Transcription and translation

Now that we now the alphabets, let's talk about the two processes that convert sequences from one alphabet to the other. First, a double-stranded DNA gets split. This single strand gets transcribed into RNA, complementing each base, but remember, RNA uses U instead of T. Every three RNA bases then translates to a new amino acid. This translation follows the genetic code table to produce new molecules.

8. Transcription DNA to RNA

Using BStrings we start with a short DNA sequence saved in a DNAString object. Then, transcription is the process in which a particular segment of DNA is copied into RNA. Using RNAString, it will change all of the T's from the dna_seq to U's in the rna_seq, keeping the same sequence length. We could also begin with a Set if we wanted to transcribe multiple sequences at the same time.

9. Translation RNA to amino acids

To translate RNA sequences into Amino Acid sequences, we need the apply the translate function to translation codes stored in rna_seq. In the example, rna_seq is translated into MIS*. Where three RNA bases return one Amino Acid. Hence, translation always returns a shorter sequence.

10. Shortcut translate DNA to amino acids

Transcription and translation are two separated processes in real life. But, in coding, there is a shortcut. The function translate also accepts DNA Strings and it automatically transcribes to RNA before translating the sequence to Amino Acids, providing the same results.

11. The Zika virus

For this chapter, you will use the Zika Virus genome to interact with the biostrings package. The Zika virus genome is very small, containing about 10 thousand base pairs. A virus needs a host to live in, and the Zika virus is common in tropical areas around the world, spreading through mosquitoes or blood.

12. Let's practice with the Zika virus!

So, let's now analyze the virus genome sequence with the help of the Biostrings package. Have fun!