Removing duplicates

It is always a good practice to check that your sequence reads don't contain too many duplicates.

# Sample with duplicates of class: ShortReadQ
dfqsample

# Get the reads from dfqsample
mydReads <- sread(dfqsample)

# Counting duplicates
table(srduplicated(mydReads))

How would you go about removing duplicated reads in a file? Pay attention to what the condition should be in this filter.

This exercise is part of the course

Introduction to Bioconductor in R

Hands-on interactive exercise

Turn theory into action with one of our interactive exercises

This exercise is part of the course

Introduction to Bioconductor in R

IntermediateSkill Level

4.8+

Start Course for Free

In this chapter, you will get hands-on with Bioconductor. Bioconductor is the specialized repository for bioinformatics software, developed and maintained by the R community. You will learn how to install and use bioconductor packages. You'll be introduced to S4 objects and functions, because most packages within Bioconductor inherit from S4. Additionally, you will use a real genomic dataset of a fungus to explore the BSgenome package.

Exercise 1: Introduction to the Bioconductor Project Exercise 2: Bioconductor version Exercise 3: BiocManager to install packages Exercise 4: The role of S4 in Bioconductor Exercise 5: S4 class definition Exercise 6: Interaction with classes Exercise 7: Introducing biology of genomic datasets Exercise 8: Discovering the yeast genome Exercise 9: Partitioning the yeast genome Exercise 10: Available genomes

Biostrings are memory efficient string containers. Biostring has matching algorithms, and other utilities, for fast manipulation of large biological sequences or sets of sequences. How efficient you can become by using the right containers for your sequences? You will learn about alphabets, and sequence manipulation by using the tiny genome of a virus.

Exercise 1: Introduction to Biostrings Exercise 2: Exploring the Zika virus sequence Exercise 3: Biostrings containers Exercise 4: Manipulating Biostrings Exercise 5: Sequence handling Exercise 6: From a set to a single sequence Exercise 7: Subsetting a set Exercise 8: Common sequence manipulation functions Exercise 9: Why are we interested in patterns?Exercise 10: Searching for a pattern Exercise 11: Finding Palindromes Exercise 12: Finding a conserved region within six frames Exercise 13: Looking for a match

The IRanges and GenomicRanges packages are also containers for storing and manipulating genomic intervals and variables defined along a genome. These packages provide infrastructure and support to many other Bioconductor packages because of their enriching features. You will learn how to use these containers and their associated metadata, for manipulation of your sequences. The dataset you will be looking at is a special gene of interest in the human genome.

Exercise 1: IRanges and Genomic Structures Exercise 2: IRanges Exercise 3: Constructing IRanges Exercise 4: Interacting with IRanges Exercise 5: Gene of interest Exercise 6: From tabular data to Genomic Ranges Exercise 7: GenomicRanges accessors Exercise 8: ABCD1 mutation Exercise 9: Human genome chromosome X Exercise 10: Manipulating collections of GRanges Exercise 11: A sequence window Exercise 12: Is it there?Exercise 13: More about ABCD1 Exercise 14: How many transcripts?Exercise 15: From GRangesList object into a GRanges object

ShortRead is the package for input, manipulation and assessment of fasta and fastq files. You can subset, trim and filter the sequences of interest, and even do a report of quality. An extra bonus towards the last exercises will give you the tools for parallel quality assessment, wink, wink Rqc. Exciting enough, for this you will use plant genome sequences!

Exercise 1: Sequence files Exercise 2: Why fastq?Exercise 3: Reading in files Exercise 4: Exploring a fastq file Exercise 5: Extract a sample from a fastq file Exercise 6: Sequence quality Exercise 7: Exploring sequence quality Exercise 8: Base quality plot Exercise 9: Try your own nucleotide frequency plot Exercise 10: Match and filter Exercise 11: Filtering reads on the go!Exercise 12: Removing duplicates

Current Exercise

Exercise 13: More filtering!Exercise 14: Multiple assessment Exercise 15: Plotting cycle average quality Exercise 16: Introduction to Bioconductor