Quality Control of RNA-seq Data

Do not rush through the quality control step of your RNA sequencing (RNA-seq) data processing pipeline. It is really the data exploration part of whatever pipeline you are running — you will thank yourself later for taking the time to execute properly and thoroughly. At LENSai (IPA), quality control (QC) is an important part of RNA expression analysis and variant calling pipelines. The list of things that can go wrong with the RNA-seq data is extensive, so your best chance to avoid some funky chemistry or biology is to put extra effort into the quality control step.

In this post, I will briefly review some common steps and associated tools for QC, as well as some tips and tricks that I wish had been mentioned in that off-the-shelf tutorial! I will focus on RNA-seq data specifically, since it involves a couple of additional levels of complexity compared with the whole genome sequencing data.

Roughly speaking the QC consists of two main components:

Quality control of raw reads

This part is essentially the same as for whole genome sequencing data. You want to check at least the following metrics:

A canonical tool for collecting all of the above (and more!) is FastQC .

When referring to the reads which underperform according to some of the QC metrics, the eternal questions are “To trim or not to trim?” and “To filter or not to filter?” Unfortunately, a “one size fits all” answer does not and cannot exist. Frequently your computational tools are not fit to help and insight from a biologist becomes indispensable.

Quality control of aligned reads

Once the reads are aligned, some other useful QC metrics become available for grabbing:

Qualimap is one of the tools covering these metrics.

Personal experience and extra tips

The last time we did RNA-seq data processing at LENSai (IPA) on a collection of samples, we had a strong suspicion that the data had significant ribosomal RNA (rRNA) contamination. The suspicion was triggered by a rather high quantity of ambiguously aligned reads. After an excessive amount of time researching what proportion of multi-mappers should be acceptable, a simple solution came around: look at the highest expressed genes. For the samples in question, the rRNA genes were indeed among the top highest expressed genes. Somewhat annoying, but good to be aware of.

The same collection of samples exhibited another unwelcome feature: our processing pipeline required significantly more resources for some of the samples than for the others. In hindsight, those troublesome samples could have been identified preventively by careful examination of the QC metrics.

These hiccups motivated introduction of additional simple but effective steps into our QC pipeline:

These QC steps have become an integral part of our RNA expression analysis and variant calling pipelines at LENS ai(IPA). I hope these tips for QC of raw and aligned reads will be helpful for you as well, and will save you some of your valuable time.

Useful references

There are plenty of tutorials dedicated to RNA-seq data processing. Here are a couple with comparatively extended discussions on different QC steps as well as a detailed interpretation of what different anomalies might indicate chemically or biologically:

Register for future blogs

Originally published at https://blog.biostrand.be.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BioStrand (a subsidiary of IPA)

Software and proprietary solutions for MULTI-omics data analysis. Effective research requires convenient and scalable tools.