Get startedGet started for free

Introducing biology of genomic datasets

1. Introducing biology of genomic datasets

Let's introduce a bit of cell biology, talking about organisms, genomes, and the yeast genome, in particular.

2. Organisms

An organism is a complex structure of interconnected elements that integrate the overall functioning of the being. Organism diversity is immense, from unicellular to multicellular, with a nucleus and without it, different membranes and systems, different life cycles, and more. In bioinformatics, organisms are studied in detail by sequencing genomes and dissecting its elements to find interesting functions.

3. What is DNA, what is a genome?

All organisms have a genome that makes up what they are, and it dictates responses to external influences. A genome is the complete genetic material of an organism stored mostly in the chromosomes, it's known as the blueprint of the living. A genome is made of long sequences of DNA, based on a four-letter-alphabet, T, A, G and C.

4. Genome elements

We are interested in locating and describing specific locations in a genome because this allows us to learn about diversity, evolution, hereditary changes, and more. To understand this better we subdivide a genome. The written information in a genome uses the DNA alphabet. Think of a genome as a set of books and each book is a chromosome. Chromosome numbers on each genome are highly variable. Usually, chromosomes come in pairs, but multiple sets are very common too. Each chromosome has ordered genetic sequences, think of chapters in a book. To find specific genetic instructions we look at genes. These are like the pages in a book, containing a recipe to make proteins. Some genes will produce proteins but some won't. These are called coding and non-coding genes. Coding genes are expressed through proteins responsible for specific functions. Proteins come up following a two-step process, DNA-to-RNA, a step known as transcription, while the RNA-to-protein is a step called translation.

5. Yeast

As an example, we are going to study the Yeast genome, a single cell microorganism, and the fungus that people love. Yeast is used for fermentation and production of beer, bread, kefir, kombucha and other foods, as well as used for bioremediation. Yeast is a very well studied organism, due to its fast development, many experiments use it as model.

6. Yeast genome

The yeast genome is a dataset available from UCSC, which we can load from BSgenome. We have picked this genome because it has a relatively small size. In the following exercises, you will find out more about this genome. For example, we can call a specific genome version and assign it to an object as shown here. The BSgenome package provides us with many genome datasets. To get a list of the available BSgenome datasets, call the available-dot-genomes() function.

7. Accessor functions

Then, using common accessors functions, you can learn more about the genome; for example, the number of chromosomes using length(), the names of the chromosomes using names(), and the length of each chromosome by DNA base pairs, using seqlengths().

8. Get sequences

Specific genes or regions are interesting because of their functions. We can retrieve sections of a genome with the getSeq() function. Passing the yeast genome to getSeq will return all of the sequences in the yeast genome. Additionally, we can specify other arguments to select sequences from specific chromosomes, such as chromosome M. We can also specify the locations of the subsequences to extract, using the start, end, and width arguments. Using, end equals 10 selects the first 10 base pairs of each chromosome of the genome.

9. Let's practice!

Now it's your turn to explore the yeast genome using functions from the BSgenome package.