Get startedGet started for free

Sequence handling

1. Sequence handling

As you'll begin to discover, Biostrings has many sophisticated string handling utilities for sequence analysis. In this video, we'll go on a little tour of these functionalities and continue using the Zika virus genome sequence.

2. Single vs. Set

In the previous exercises, you've been using DNAStrings and DNAStringSets. As a recap, any Xstring will hold one single sequence of a predefined alphabet. However, when we want to store and handle multiple sequences or collections we'll use a StringSet. Remember that Sets can have sequences of varying lengths.

3. Create a StringSet and collate it

How do we go from a set to a string, and vice versa? First, we read a sequence file with the function readDNAStringSet(). Notice that this object is like a list with a length of 1, because it only contains one sequence, and its width is the total number of letters, or bases, in this sequence. Then, to convert a StringSet into a single string, use the unlist() function to collate the elements. In the example, the resulting DNAstring has a length of 10794 characters, but DNAStrings have no width.

4. From a single sequence to a set

In case you want to construct a set from a single sequence, use the function DNAStringSet, and specify the sequence, here zikaVirus_seq, and the subsequences start and end, or the start and width as numeric vectors. Notice the example, zikaSet has three subsequences each having a sequence width of 100 letters. This has been specified by start and end.

5. Complement sequence

Did you notice that all the time we have been using DNAStrings or sets, they are one single sequence instead of a double DNA sequence? This is because we can computationally derive the complement of the sequence, A is always paired with T, and G is always paired with C, so we don't need to store the sequencing in both strands. We can generate the complement sequence when needed using the complement() function.

6. Rev a sequence

rev() is a function from base R, but it is frequently used with Biostrings. This example uses, for demonstration, the zikaShortSet with only 2 sequences (seq1 and seq2) each having 18 letters. Calling rev() on this set changes the sequence order, from top to bottom. We can use this function with any string. This is also useful to reverse the order of our sequences at the same time, mainly when building a genome reference.

7. Reverse a sequence

The reverse() function from IRanges reverses from each sequence in the set from right to left, so we can generate the opposite strand of a sequence.

8. Reverse complement

Connecting what we have learned so far, we also have the reverseComplement() function, which is equivalent to reverse and compliment in one step. This function is useful for both DNA and RNA strings. The advantage of using reverseComplement() function is that it's faster and more memory efficient, which really matters when dealing with large sets and sequences.

9. Recap

This table summarizes the functions we've discussed, and which functions are specific for each string container. unlist() is used with sets, to collate the elements into a single sequence. length() depends on the container. width() is only used for sets and gives us the number of characters per sequence. complement() returns the paired strand of a given sequence. rev() will act as reverse on a single sequence or will reorder a set from bottom to top. reverse() changes the order of a sequence or a set of sequences from left to right. Finally, reverseComplement() is an efficient function which combines reverse and complement together.

10. Let's practice sequence handling!

Now it's your turn to practice Biostring handling using all these functions!