[10:51 3/10/2011 Bioinformatics-btr509.tex] Page: 2987 2987–2993
BIOINFORMATICS ORIGINAL PAPER
Vol. 27 no. 21 2011, pages 2987–2993
doi:10.1093/bioinformatics/btr509
Sequence analysis Advance Access publication September 8, 2011
A statistical framework for SNP calling, mutation discovery,
association mapping and population genetical parameter
estimation from sequencing data
Heng Li
Medical Population Genetics Program, Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA
Associate Editor: Jeffrey Barrett
ABSTRACT
Motivation: Most existing methods for DNA sequence analysis rely
on accurate sequences or genotypes. However, in applications of
the next-generation sequencing (NGS), accurate genotypes may
not be easily obtained (e.g. multi-sample low-coverage sequencing
or somatic mutation discovery). These applications press for the
development of new methods for analyzing sequence data with
uncertainty.
Results: We present a statistical framework for calling SNPs,
discovering somatic mutations, inferring population genetical
parameters and performing association tests directly based on
sequencing data without explicit genotyping or linkage-based
imputation. On real data, we demonstrate that our method achieves
comparable accuracy to alternative methods for estimating site allele
count, for inferring allele frequency spectrum and for association
mapping. We also highlight the necessity of using symmetric
datasets for finding somatic mutations and confirm that for
discovering rare events, mismapping is frequently the leading source
of errors.
Availability: http://samtools.sourceforge.net
Contact: hengli@broadinstitute.org
Received on July 20, 2011; revised on August 30, 2011; accepted
on September 1, 2011
1 INTRODUCTION
The 1000 Genomes Project (1000 Genomes Project Consortium,
2010) sets an excellent example on how to design a sequencing
project to get the maximum output pertinent to human populations.
An important lesson from this project is to sequence many human
samples at relatively low coverage instead of a few samples
at high coverage. We adopt this strategy because with higher
coverage, we will mostly reconfirm information from other reads,
but with more samples, we will be able to reduce the sampling
fluctuations, gain power on variants present in multiple samples and
get access to many more rare variants. On the other hand, sequencing
errors counteract the power in variant calling, which necessitates a
minimum coverage. The optimal balancing point is broadly regarded
to be in the 2–6 fold range per sample (Le and Durbin, 2010; Li
et al., 2011), depending on the sequencing error rate, level of linkage
disequilibrium (LD) and the purpose of the project.
A major concern with this design is that at 2–6 fold coverage
per sample, non-reference alleles may not always be covered by
sequence reads, especially at heterozygous loci. Calling variants
from each individual and then combining the calls usually yield poor
results. The preferred strategy is to enhance the power of variant
discovery by jointly considering all samples (Depristo et al., 2011;
Le and Durbin, 2010; Li et al., 2011; Nielsen et al., 2011). This
strategy largely solves the variant discovery problem, but acquiring
accurate genotypes for each individual remains unsolved. Without
accurate genotypes, most of the previous methods [e.g. testing
Hardy–Weinberg equilibrium (HWE) and association mapping]
would not work.
To reuse the rich methods developed for genotyping data, the
1000 Genomes Project proposes to impute genotypes utilizing LD
across loci (Browning and Yu, 2009; Howie et al., 2009; Li et al.,
2009b, 2010a). Suppose at a site A, one sample has a low coverage.
If some samples at A have high coverage and there exists a site B
that is linked with A and has sufficient sequence support, we can
transfer information across sites and between individuals, and thus
make a reliable inference for the low-coverage sample at A. The
overall genotype accuracy can be greatly improved.
However, imputation is not without potential concerns. First,
imputation cannot be used to infer the regional allele frequency
spectrum (AFS) because imputation as of now can only be applied
to candidate variant sites, while we need to consider non-variants
to infer AFS. Second, the effectiveness of imputation depends on
the pattern of LD, which may lead to potential bias in population
genetical inferences. Third, the current imputation algorithms are
slow. For a thousand samples, the fastest algorithm may be slower
than read mapping algorithms, which is frequently the bottleneck
of analyzing NGS data (H.M.Kang, personal communication).
Considering more samples and using more accurate algorithms will
make imputation even slower.
These potential concerns make us reconsider if imputation is
always preferred. We notice that we perform imputation mainly
to reuse the methods developed for genotyping data, but would it
be possible to derive new methods to solve classical medical and
population genetical problems without precise genotypes?
Another application of NGS that requires genotype data is to
discover somatic mutations or germline mutations between a few
related samples (Conrad et al., 2011; Ley et al., 2008; Mardis et al.,
2009; Pleasance et al., 2010a, b; Roach et al., 2010; Shah et al.,
2009). For such an application, samples are often sequenced to
high coverage. Although it is not hard to achieve an error rate
one per 100 000 bases (Bentley et al., 2008), mutations occur at
a much lower rate, typically of the order of 10
−6
or even 10
−7
.
Naively calling genotypes and then comparing samples frequently
would not work well (Ajay et al., 2011), because subtle uncertainty
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 2987
at Shanghai academy of science & technology on May 13, 2014http://bioinformatics.oxfordjournals.org/Downloaded from