1
What is Bioconductor?
Free
In this chapter, you will get hands-on with Bioconductor. Bioconductor is the specialized repository for bioinformatics software, developed and maintained by the R community. You will learn how to install and use bioconductor packages. You'll be introduced to S4 objects and functions, because most packages within Bioconductor inherit from S4. Additionally, you will use a real genomic dataset of a fungus to explore the BSgenome package.
2
Biostrings and When to Use Them?
Biostrings are memory efficient string containers. Biostring has matching algorithms, and other utilities, for fast manipulation of large biological sequences or sets of sequences. How efficient you can become by using the right containers for your sequences? You will learn about alphabets, and sequence manipulation by using the tiny genome of a virus.
3
IRanges and GenomicRanges
The IRanges and GenomicRanges packages are also containers for storing and manipulating genomic intervals and variables defined along a genome. These packages provide infrastructure and support to many other Bioconductor packages because of their enriching features. You will learn how to use these containers and their associated metadata, for manipulation of your sequences. The dataset you will be looking at is a special gene of interest in the human genome.
4
Introducing ShortRead
ShortRead is the package for input, manipulation and assessment of fasta and fastq files. You can subset, trim and filter the sequences of interest, and even do a report of quality. An extra bonus towards the last exercises will give you the tools for parallel quality assessment, wink, wink Rqc. Exciting enough, for this you will use plant genome sequences!

Initializing

Exploring a fastq file

Fastq files usually contain thousands or millions of reads, and can become very large in size! For this exercise, you will use a small fastq sub sample of 500 reads, which fits easily into memory and can be read entirely using the function readFastq().

The original sequence file comes from Arabidopsis thaliana, provided by the UC Davis Genome Center. The accession number is SRR1971253 and was downloaded from the Sequence Read Archive (SRA). It contains DNA from leaf tissues, pooled and sequenced on Illumina HiSeq 2000. These sequences are single-read sequences with 50 base pairs (bp) length.

fqsample is a ShortReadQ object and contains information about reads, quality scores, and ids. It's your turn to explore it!

Load the ShortRead package and print fqsample to view it.

What is Bioconductor?

Biostrings and When to Use Them?

IRanges and GenomicRanges

Introducing ShortRead

Exercise

Exploring a fastq file

Instructions 1/3