
http://genomebiology.com/2009/10/3/R32 Genome Biology 2009, Volume 10, Issue 3, Article R32 Harismendy et al. R32.3
Genome Biology 2009, 10:R32
sequence coverage and for systematic biases giving rise to low
coverage. We show that each NGS platform generates its own
unique pattern of biased sequence coverage that is consistent
between samples. For the short-read platforms, low coverage
intervals tend to be in AT-rich repetitive sequences. We also
performed a comparative analysis with sequence generated
by the well-established ABI Sanger platform (Figure 1) to
determine base calling accuracies and how average fold
sequence coverage impacts base calling errors. Although the
three NGS technologies correctly identify >95% of variant
alleles, the average sequence coverage required to achieve
this performance is greater than the targeted levels of most
current studies.
Results
Generation and alignment of sequence reads to
targeted intervals
The targeted sequence was amplified in the four DNA sam-
ples using long-range PCR (LR-PCR) reactions that were
combined in equimolar amounts and sequenced using the
three NGS technologies (Figure 1). For the Roche 454 plat-
form we obtained an average of 49,000 reads per sample with
an average length of 245 bp (Supplemental Table 1 in Addi-
tional data file 1), using Illumina GA we generated an average
of 5.9 million reads each 36 bases in length per sample, and
using ABI SOLiD we obtained an average of 19.7 million reads
each 35 bases in length per sample. Thus, the amount of
sequence data generated and analyzed was dependent on the
NGS platform and the fraction of the run that was utilized.
The NGS technologies generate a large amount of sequence
but, for the platforms that produce short-sequence reads,
greater than half of this sequence is not usable. On average,
55% of the Illumina GA reads pass quality filters, of which
approximately 77% align to the reference sequence (Supple-
mental Table 1 in Additional data file 1; Additional data file 2).
For ABI SOLiD, approximately 35% of the reads pass quality
filters, and subsequently 96% of the filtered reads align to the
reference sequence. Thus, only 43% and 34% of the Illumina
GA and ABI SOLiD raw reads, respectively, are usable. In con-
trast to the platforms generating short-read lengths, approxi-
mately 95% of the Roche 454 reads uniquely align to the
target sequence. When designing experiments and calculat-
ing the target coverage for a region, one must consider the
fraction of alignable sequence.
Overrepresentation of amplicon end sequences
In examining the distribution of mapped reads, we observed
that the sequences corresponding to the 50 bp at the ends and
the overlapping intervals of the amplicons have extremely
high coverage (Figure 2; Additional data file 2). These
regions, representing about 2.3% (approximately 6 kb) of the
targeted intervals, account for up to 56% of the sequenced
base pairs for Illumina GA technology. This extreme sequence
coverage bias results from overrepresentation of the ampli-
con ends in the DNA samples after fragmentation prior to
library generation. For the ABI SOLiD platform an amplicon
end depletion protocol was employed to remove the overrep-
resented amplicon ends; this was partially successful and
resulted in the ends accounting for up to 11% of the sequenced
base pairs. For the Roche 454 technology, overrepresentation
of amplicon ends versus internal bases is substantially less,
with the ends composing only 5% of the total sequenced
bases; this is likely due to library preparation process differ-
ences between Roche 454 and the short-read length plat-
forms. The overrepresentation of amplicon end sequences is
not only wasteful for the sequencing yield but also decreases
the expected average coverage depth across the targeted
intervals. Therefore, to accurately assess the consequences of
sequence coverage on data quality, we removed the 50 bp at
the ends of the amplicons from subsequent analyses.
Sequence coverage of targeted intervals
For each platform we generated a saturating level of redun-
dant sequence coverage, meaning that increased coverage is
likely to have minimal, if any, effect on data quality. For the
four samples the average sequence coverage depth across the
analyzed base pairs is 43×, 188×, and 841× for Roche 454,
Illumina GA, and ABI SOLiD, respectively (Supplemental
Table 2 in Additional data file 1). For all three NGS technolo-
gies there is greater than a hundred-fold variation in the per-
base sequence coverage depth (Figure 2). We performed sev-
eral analyses to determine if the sample preparation method
and/or a specific class of sequence elements were responsible
for the observed variability (Additional data file 2). We first
tested whether the large variability resulted from pooling of
the amplicons. For 90% of the amplicons the fold difference
in average coverage of unique sequences is less than 2.46,
2.72, and 2.99 on the Roche 454, Illumina GA and ABI SOLiD
platforms, respectively (Supplemental Table 3 in Additional
data file 1), showing that the error in equimolar pooling or
amplicon specific bias (sequence, length) explains only a
small fraction of the observed coverage variability. Next we
examined how the sequence coverage differs within the indi-
vidual amplicons. For Roche 454, Illumina GA, and ABI
SOLiD the average coefficient of variance was 0.33, 0.9, and
0.73, respectively, for all base pairs, and 0.35, 0.84 and 0.76,
respectively, when restricted to unique non-repetitive
sequence, defined here as not present in the RepBase data-
base [16]. These results indicate that unique sequences
present at equimolar amounts in the library generation step
end up being covered at vastly different read depths.
It is important to consider how well the NGS technologies are
able to generate sequence reads containing repetitive ele-
ments as these sequences comprise approximately 45% of the
human genome and may potentially impact genome function.
Compared to unique sequences, the Roche 454 technology
has a 1.25-fold overrepresentation of LINE elements, Illu-
mina GA has greater than 2-fold higher coverage of SINEs,
Alus and simple repeats, while for ABI SOLiD all repetitive