1. Match and filter
This has been great so far. Let's now learn how to do matching and filtering at the time of reading big files. This will save you time and resources!
2. Duplicate sequences
Finding identical sequences in your file should trigger an alarm, especially if there are many duplicates.
A few biological duplicates naturally occur in nature.However, another source of duplication is commonly resulting from PCR amplification in library preparation. Also when sequencing the same molecule more than once produces duplicates.
If sequencing has errors or contamination you might find that 30% to 70% of your reads are identical copies of each other.
If you are working with whole genome sequencing or exome sequencing, it is recommended to remove duplicates or at least mark them as they will have adverse effects in scaffolding and discovering large-scale genomic variations.
With this in mind, you should set a threshold of acceptable duplicate percentage when working with RNA-seq of ChIP-seq because you do not want to get rid of biological relevant sequences.
3. srduplicated
This example contains 100% duplication.
To check for duplicates, you can use the srduplicated() function which returns a logical value for each read in the file. TRUE means the read is duplicated.
If you use this with table(), you get the counts of the duplicated reads.
If of all the reads are duplicated, like in this example, this is a clear mistake.
One way to clean these reads is by sub-setting all those reads that are marked as not duplicated with a condition in the vector.
Finally, if you use table and srduplicated() on the clean reads you will get no duplicates!
4. Creating your own filters
What about your own filters?
srfilter() is a function to construct your own personalized ShortRead filters.
It accepts a single argument (in the example, fqsample) and returns a logical vector used to select elements of fqsample satisfying a condition.
The filter is saved in an object for convenience, here called readWidthCutOff. Then srFilter() receives a function of x, followed by a condition. Extra parameters can be specified before calling the filter, for example, minWidth. The name of the filter is optional.
Lastly, the filter is applied on fqsample using the filter to subset.
A filter can even be called at the time of reading the fastq file with the parameter named filter.
5. nFilter
Now you know that you can create your own custom filters!
But, I wanted you to also take advantage of some built-in filters from the ShortRead package.
For example, nFilter() has a threshold parameter representing the maximum number of N's allowed on each read.
The second parameter dot-name is optional and add a custom name to your filter.
As in the previous exercise, you could use the filter as a subsetting function. Or, like in this example, you can use the filter directly when reading the fastq files.
This is a very fast cleaning step.
6. idFilter and polynFilter
Similarly, if you know which ids are the ones you need you can use, for example, the idFilter() function. It uses a regular expression to select specific ids, also called sequence names.
myFilterID will select those id's that contain :3:1.
Other optional parameters are dot-name, fixed, and exclude, the last two logical parameters influence how the matching occurs.
Another nice filter to include is the polynFilter(). Here, myfilterPolyA accepts a maximum of 10 continuous A's per sequence read and filters everything else.
Like before, you can use any of these filters at the time of reading the file, hence only reading what you need (or use them as subsetting).
7. Let's practice using filters!
Time to put this into practice, you can do it!