experimental design, especially when the experiment in-
volves a large number of samples that need to be proc-
essed in several batches. In this case, including controls,
randomizing sample processing and smart management
of sequencing runs are crucial to obtain error-free data
(Fig. 1a; Box 2).
Analysis of the RNA-seq data
The actual analysis of RNA-seq data has as many varia-
tions as there are applications of the technology. In this
section, we address all of the major analysis steps for a
typical RNA-seq experiment, which involve quality con-
trol, read alignment with and without a reference genome,
obtaining metrics for gene and transcript expression, and
approaches for detecting differential gene expression. We
also discuss analysis options for applications of RNA-seq
involving alternative splicing, fusion transcripts and small
RNA expression. Finally, we review useful packages for
data visualization.
Quality-control checkpoints
The acquisition of RNA-seq data consists of several
steps — obtaining raw reads, read alignment and quanti-
fication. At each of these steps, specific checks should
be applied to monitor the quality of the data (Fig. 1a).
Raw reads
Quality control for the raw reads involves the analysis of
sequence quality, GC content, the presence of adaptors,
overrepresented k-mers and duplicated reads in order to
detect sequencing errors, PCR artifacts or con tamina-
tions. Acceptable duplication, k-mer or GC content
levels are experiment- and organism-specific, but these
values should be homogeneous for samples in the same
experiments. We recommend that outliers with over
30 % disagreement to be discarded. FastQC [11] is a
popular tool to perform these analyses on Illumina
reads, whereas NGSQC [12] can be applied to any plat-
form. As a general rule, read quality decreases towards
the 3’ end of reads, and if it becomes too low, ba ses
should be removed to improve mappability. Software
tools such as the FASTX-Toolkit [13] and Trimmomatic
[14] can be used to discard low-qua lity reads, trim
adaptor sequences, and eliminate poor-quality bases.
Read alignment
Reads are typically mapped to either a genome or a tran-
scriptome, as will be discussed later. An important map-
ping quality parameter is the percentage of mapped
reads, which is a global indicator of the overall sequen-
cing accuracy and of the presence of contaminating
DNA. For example, we expect between 70 and 90 % of
regular RNA-seq reads to map onto the human genome
(depending on the read mapper used) [15], with a sig-
nificant fraction of reads mapping to a limited number
of identical regions equally well (‘multi-mapping reads’).
When reads are mapped against the transcriptome, we
expect slightly lower total mapping percentages because
reads coming from unannotated transcripts will be lost,
and significantly more multi-mapping reads because of
reads falling onto exons that are shared by different
transcript isoforms of the same gene.
Other important parameters are the uniformity of read
coverage on exons and the mapped strand. If reads
Box 2. Experiment execution choices
RNA-seq library preparation and sequencing procedures include
a number of steps (RNA fragmentation, cDNA synthesis, adapter
ligation, PCR amplification, bar-coding, and lane loading) that
might introduce biases into the resulting data [196]. Including
exogenous reference transcripts (‘spike-ins’) is useful both for
quality control [1, 197] and for library-size normalization [198].
For bias minimization, we recommend following the suggestions
made by Van Dijk et al. [199], such as the use of adapters with
random nucleotides at the extremities or the use of chemical-based
fragmentation instead of RNase III-based fragmentation. If the
RNA-seq experiment is large and samples have to be processed in
different batches and/or Illumina runs, caution should be taken to
randomize samples across library preparation batches and lanes so
as to avoid technical factors becoming confounded with
experimental factors. Another option, when samples are individually
barcoded and multiple Illumina lanes are needed to achieve the
desired sequencing depth, is to include all samples in each lane,
which would minimize any possible lane effect.
Table 1 Statistical power to detect differential expression varies
with effect size, sequencing depth and number of replicates
Replicates per group
3510
Effect size (fold change)
1.25 17 % 25 % 44 %
1.5 43 % 64 % 91 %
2 87 % 98 % 100 %
Sequencing depth (millions of reads)
3 19% 29% 52%
10 33 % 51 % 80 %
15 38 % 57 % 85 %
Example of calculations for the proba bility of detecting differential expression
in a single test at a significance level of 5 %, for a two-group comparison using
a Negative Binomial model, as computed by the RNASeqPower package of
Hart et al. [190]. For a fixed within-group variance (package default value), the
statistical power increases with the difference between the two groups (effect
size), the sequencing depth, and the number of replicates per group. This
table shows the statistical power for a gene with 70 aligned reads, which was
the median coverage for a protein-codin g gene for one whole-blood RNA-seq
sample with 30 million aligned reads from the GTEx Project [214]
Conesa et al. Genome Biology (2016) 17:13 Page 4 of 19