1. Why are we interested in patterns?
Why are we interested in patterns?
Patterns are usually peculiar, interesting, and fun. Think about a sunflower head, the stripes on a zebra, and fingerprints. Patterns in Biology are outstanding and we can learn more about them using sequencing!
2. Sequence code
Sequence patterns in DNA help us find interesting things, such as sequence repeats, proteins and codons, poly-A tails, conserved sequences, binding sites, and more. Our goal in analyzing sequence patterns is to discover their occurrence frequency, periodicity, and length.
3. What can we find with patterns?
Where does a gene start, where does a protein end, which regions make a gene expressed or silent, which regions are conserved between organisms, and what is the overall genetic variation, are common questions solved by sequence pattern matching.
4. Pattern matching
The Biostrings package has a few search functions, which find all the occurrences of a pattern in a subject sequence. The pattern tends to be a short sequence and the subject, a longer sequence. Occurrences may allow the presence of mismatches, so-called fuzzy matching, or can be very stringent, where exact matches are required.
The matchPattern() function compares one single string to another single string.
On the other hand, vmatchPattern() is used to match multiple sequences, for example, when using sets. Each of these functions will return a different object as a result, but the match will be the same.
5. Palindromes
Palindromes are sequences that read the same backwards as forwards. For example, "never odd or even" also reads "never odd or even" backwards.
In biology, palindromes occur at sites highlighting important reactions, such as binding sites and sites interrupted by restriction enzymes.
Biostrings comes with a handy function called findPalindromes() to help find palindromes in single sequences, and identify these important sites.
6. Not new biology
This video discusses some many biological concepts that may be new to you, which we won't cover in detail in this course. But, you can always learn more with targeted reading.
The first is the genetic code, which is a table that describes which three RNA letters translate to one amino acid. The Genetic code was first described by Nirenberg in 1963.
Then, how translation might differ according to the reading frame, was first described by Streisinger in 1966.
The abstract introduces you to new terms and how a different sequence is translated depending on the start point.
7. Translation has six possibilities
This is a real example of how translation varies according to the start of the sequence, and how we can make sure to translate all possibilities.
From a single DNA string, there are 6 possible string frames.
3 are positive strands and 3 negative strands. A negative strand is the reverse complement of a positive sequence strand.
Because translation needs three bases for an amino acid, you get a completely different amino acid sequence, depending on where you start.
That is why for translation, we move one base at the time. That is called a single base sliding window.
As we can see, each DNA reading frame translates to a different amino acid sequence.
8. Conserved regions in the Zika virus
Now the super exciting part! coming back to our Zika virus example. You'll be in charge of finding a very conserved sequence in the family of Flaviviruses, from which the Zika Virus is part of. First some facts: The Zika Virus has a positive strand genome. It can live in different host cells. For example, Humans, monkeys, and mosquitoes. The Flaviviruses family share a common structure, which means their sequences are very similar. The virus structure has only 11 proteins. In the last exercise, you'll be in search of one of these proteins using what you have learned so far!
9. Let's practice finding patterns!
It's your turn to try finding patterns. Have fun!