Get startedGet started for free

Manipulating collections of GRanges

1. Manipulating collections of GRanges

Let's now explore Genomic Ranges functions, some of which will return a collection of Ranges called Genomic Ranges Lists.

2. GRangesList

The GRangesList class is a container, particularly efficient for storing a large collection of GRanges objects. To construct these special Lists, you can use the function as(), and give it a list to be converted into a GRangesList. You can also create a GRangesList by listing multiple GRranges objects. Inversely, to convert back to GRanges use the function unlist(). Finally, to find useful accessors use the methods() function with (class = "GRangesList") in quotes.

3. When to use lists?

You might ask yourself why would we use these lists? GRangesList serves to store compound features of a larger object, in which you can perform operations. Some examples of GRangesLists are: transcripts by gene, exons by transcripts, read alignments, and sliding windows.

4. Break a region into smaller regions

Sliding windows are useful to split a GRanges object into sub-elements. This function uses width and step parameters. Width is the total number of letters for each new range, and Step is the distance between ranges. This returns a GRangesList. In the example, each gene has been split into new ranges of width 20,000 bases, and the distance between ranges is 10,000 bases because of the step. Each range has an overlap of 10,000 bases because of width - step. In most cases, the last range will be shorter.

5. Genomic features and TxDb

Genes, transcripts, and exons are genomic features. The GenomicFeatures package retrieves and manages this information from providers like UCSC and BioMart. These annotated features are useful for ChIP-seq, RNA-seq and annotation analyses. GenomicFeatures uses transcript database-objects to store metadata, manage genomic locations, and relationships between features and its identifiers. Bioconductor provides built-in packages for the most used transcript databases. For the example, we will use the TxDb for known human genes version 38. Here is a trimmed output displaying the most important information of a TxDb object.

6. Genes, transcripts, exons

Let's now learn how to extract genomic features from a TxDb object. First, load the TxDb library and store the object. If you are interested only in a subset of chromosomes its recommended to pre-filter using seqlevels(). Notice, that this is not the only way of filtering. Here we will show two extracting functions: transcripts and exons. There are three others: genes, cds, and promoters). All of them receive a TxDb object and optional parameters columns and filter. Columns are to select column names, filter uses a condition on a column. Filter and columns receive a named list of vectors and the valid names are listed at the end of the slide.

7. Exons by transcripts

Each gene has one or more transcripts, and each transcript has a set of exons. To find the exons in this transcript, retrieve all the exons by transcript using the function exonsBy() where tx is short for a transcript. Then select the transcript, with id 179-161. The figure shows the exons on this transcript. Each purple region is an exon and in between exons we see introns. This transcript shows 10 exons and we see their widths as a numeric vector. Pretty neat!

8. Overlaps

To find genes of interest in a larger interval or a collection of intervals, you will use overlaps. Counting, finding and subsetting overlaps between objects containing genomic ranges are useful and fundamental to annotating genomic features. The following functions have been optimized for iterations. CountOverlaps(), findOverlaps() and subsetByOverlaps() need basically two objects to be compared - a query and subject, which are either a GRanges or GRangesList objects. Overlaps might be complete when the query matches completely, or partial if the match is a subset of the query.

9. It's your turn to put this into practice!

Time to put this into practice!