Gene of interest

1. Gene of interest

Let's start a scientific search for a gene of interest while learning more about genomic ranges.

2. Examples of genomic intervals

When we work with genome data we mostly work by comparing sequence intervals to a reference. A genome is represented as a linear sequence, split over multiple chromosomes, hence instead of having only one sequence like with IRanges, we can have sets of sequences using genomic ranges. Additionally, biological relevant features are included as metadata in GRanges. Examples of genome intervals are reads aligned to a reference, genes of interest, exonic regions, SNPs, regions of transcription or binding sites, like those regions studied using RNA or ChIP seq.

3. Genomic Ranges

The GenomicRanges package has the class GRanges a type of container used to save genomic intervals per chromosome. The bare minimum arguments are chromosome name, start and end of the interval. You can define them using a character. The basic difference between IRanges and GRanges is that each range is associated with a chromosome and a strand. In addition to metadata per range, like score and GC percentage, GRanges also includes additional metadata such as interval names, using seqnames() sequence lengths, and genome, stored in seqinfo.

4. From data to GRanges

You will most likely have sequence intervals stored in a data frame, it can also be a tibble or another table-like structure. The first three columns are the minimum requirements to construct a GRanges object: chromosome or seqname, start and end. You can also add strand (which takes values positive, negative, unknown or missing) and associated metadata like scores and CG frequency. As in the example, you will transform a data frame using the function as(), providing as input a dataframe object and the class GRanges in quotes.

5. Genomic Ranges accessors

When using Genomic Ranges you can get, add, and update extra information using its accessors. To see a list of the available accessors, use the function methods and specify class GRanges. Out of the many accessors listed by methods a few very useful ones are: seqnames, used for chromosome names. ranges, which will return an IRanges object. mcols, to display additional metadata per range. seqinfo stores a summary of the sequence information . And genome stores the genome name. It is important to notice that most accessors are both setter and getter functions. Another important fact to highlight is that you can notice the reuse of some accessors between classes thanks to inheritance within S4 definitions.

6. Gene of interest: ABCD1

Let's now talk about our gene of interest, a gene with an easy name to remember, ABCD1. ABCD1 is located at the end of chromosome X long arm. It encodes a protein relevant for the well functioning of brain and lung cells in mammals. Chromosome X is about 156 million base pairs long and our gene is located in a small interval, around the 153 million base pairs mark.

7. Chromosome X GRanges

Let's now prepare our data to explore. We will use a human reference, version hg38, from the transcripts database, provided by UCSC and accessed through Genomic Features. We saved the data set into an object called hg. The human reference is about 3 billion bases long and since our gene of interest is located in chromosome X we can subset the reference using the genes function. We then add a filter argument set to a list named tx_chrom equal to chrX in quotes. This returns a GRanges object with 983 genes to explore by gene_id. There are other filters which you will use in the next exercises.

8. Let's practice looking for a gene of interest in the human genome!

Now let's try some examples using GRanges and learning more about the gene of interest, ABCD1.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.