NGS数据分析：统计框架下的SNP呼叫与突变发现

需积分: 20 47 浏览量更新于2024-09-10 收藏 255KB PDF 举报

"这篇论文提出了一种统计框架，用于从测序数据中进行SNP（单核苷酸多态性）调用、突变发现、关联映射和群体遗传参数估计，尤其适用于处理下一代测序（NGS）数据中的不确定性问题。" 在生物信息学领域，SNP（Single Nucleotide Polymorphism）调用是分析DNA序列差异的关键步骤，它涉及到识别个体间的单个核苷酸位置上的变异。通常，这个过程依赖于准确的序列或基因型信息。然而，随着下一代测序技术（Next-Generation Sequencing, NGS）的发展，如低覆盖率的多样本测序和体细胞突变检测，获取精确的基因型变得更具挑战性。这篇发表在《Bioinformatics》杂志上的文章提出了一个统计框架，旨在应对这些挑战。作者Heng Li介绍了如何在无法直接获取准确基因型或者需要处理测序数据不确定性的情况下，进行SNP调用、发现体细胞突变，并且能够推断群体遗传参数。这一框架的核心在于，它直接基于测序数据进行分析，而不需要先验的基因型信息或者基于连锁的填充。 SNP调用是该框架的第一步，它涉及到识别出序列中可能存在变异的位置。这一过程通常涉及到比较不同样本的测序读取，以确定在特定位置上是否有多态性。在低覆盖率的多样本测序中，由于每个样本的覆盖度不均匀，这会增加调用的复杂性。突变发现则是在SNP调用的基础上，进一步寻找体细胞突变，即在肿瘤细胞与正常细胞之间存在的DNA序列差异。这在癌症研究中尤其重要，因为理解这些突变有助于揭示癌症的发生机制。通过这个统计框架，研究者还能估计群体遗传参数，例如遗传多样性、遗传距离和群体结构。这些参数对于理解物种的演化历史和人群间的遗传差异至关重要。此外，论文还讨论了如何在这个框架下进行关联测试。关联测试是寻找遗传变异与表型（如疾病易感性）之间的关系的一种方法。在没有明确基因型的情况下，这种方法直接从测序数据中寻找证据，从而提供了一个更全面的遗传关联分析途径。这篇文章提出的统计框架提供了一种新的工具，它能够有效地利用NGS数据，处理不确定性，进行SNP调用、突变发现和群体遗传参数的推断，同时还能进行关联测试。这对于理解和利用大规模测序数据，尤其是在复杂的生物学问题研究中，具有重要的理论和实践意义。

[10:51 3/10/2011 Bioinformatics-btr509.tex] Page: 2987 2987–2993

BIOINFORMATICS ORIGINAL PAPER

Vol. 27 no. 21 2011, pages 2987–2993

doi:10.1093/bioinformatics/btr509

Sequence analysis Advance Access publication September 8, 2011

A statistical framework for SNP calling, mutation discovery,

association mapping and population genetical parameter

estimation from sequencing data

Heng Li

Medical Population Genetics Program, Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA

Associate Editor: Jeffrey Barrett

ABSTRACT

Motivation: Most existing methods for DNA sequence analysis rely

on accurate sequences or genotypes. However, in applications of

the next-generation sequencing (NGS), accurate genotypes may

not be easily obtained (e.g. multi-sample low-coverage sequencing

or somatic mutation discovery). These applications press for the

development of new methods for analyzing sequence data with

uncertainty.

Results: We present a statistical framework for calling SNPs,

discovering somatic mutations, inferring population genetical

parameters and performing association tests directly based on

sequencing data without explicit genotyping or linkage-based

imputation. On real data, we demonstrate that our method achieves

comparable accuracy to alternative methods for estimating site allele

count, for inferring allele frequency spectrum and for association

mapping. We also highlight the necessity of using symmetric

datasets for ﬁnding somatic mutations and conﬁrm that for

discovering rare events, mismapping is frequently the leading source

of errors.

Availability: http://samtools.sourceforge.net

Contact: hengli@broadinstitute.org

Received on July 20, 2011; revised on August 30, 2011; accepted

on September 1, 2011

1 INTRODUCTION

The 1000 Genomes Project (1000 Genomes Project Consortium,

2010) sets an excellent example on how to design a sequencing

project to get the maximum output pertinent to human populations.

An important lesson from this project is to sequence many human

samples at relatively low coverage instead of a few samples

at high coverage. We adopt this strategy because with higher

coverage, we will mostly reconﬁrm information from other reads,

but with more samples, we will be able to reduce the sampling

ﬂuctuations, gain power on variants present in multiple samples and

get access to many more rare variants. On the other hand, sequencing

errors counteract the power in variant calling, which necessitates a

minimum coverage. The optimal balancing point is broadly regarded

to be in the 2–6 fold range per sample (Le and Durbin, 2010; Li

et al., 2011), depending on the sequencing error rate, level of linkage

disequilibrium (LD) and the purpose of the project.

A major concern with this design is that at 2–6 fold coverage

per sample, non-reference alleles may not always be covered by

sequence reads, especially at heterozygous loci. Calling variants

from each individual and then combining the calls usually yield poor

results. The preferred strategy is to enhance the power of variant

discovery by jointly considering all samples (Depristo et al., 2011;

Le and Durbin, 2010; Li et al., 2011; Nielsen et al., 2011). This

strategy largely solves the variant discovery problem, but acquiring

accurate genotypes for each individual remains unsolved. Without

accurate genotypes, most of the previous methods [e.g. testing

Hardy–Weinberg equilibrium (HWE) and association mapping]

would not work.

To reuse the rich methods developed for genotyping data, the

1000 Genomes Project proposes to impute genotypes utilizing LD

across loci (Browning and Yu, 2009; Howie et al., 2009; Li et al.,

2009b, 2010a). Suppose at a site A, one sample has a low coverage.

If some samples at A have high coverage and there exists a site B

that is linked with A and has sufﬁcient sequence support, we can

transfer information across sites and between individuals, and thus

make a reliable inference for the low-coverage sample at A. The

overall genotype accuracy can be greatly improved.

However, imputation is not without potential concerns. First,

imputation cannot be used to infer the regional allele frequency

spectrum (AFS) because imputation as of now can only be applied

to candidate variant sites, while we need to consider non-variants

to infer AFS. Second, the effectiveness of imputation depends on

the pattern of LD, which may lead to potential bias in population

genetical inferences. Third, the current imputation algorithms are

slow. For a thousand samples, the fastest algorithm may be slower

than read mapping algorithms, which is frequently the bottleneck

of analyzing NGS data (H.M.Kang, personal communication).

Considering more samples and using more accurate algorithms will

make imputation even slower.

These potential concerns make us reconsider if imputation is

always preferred. We notice that we perform imputation mainly

to reuse the methods developed for genotyping data, but would it

be possible to derive new methods to solve classical medical and

population genetical problems without precise genotypes?

Another application of NGS that requires genotype data is to

discover somatic mutations or germline mutations between a few

related samples (Conrad et al., 2011; Ley et al., 2008; Mardis et al.,

2009; Pleasance et al., 2010a, b; Roach et al., 2010; Shah et al.,

2009). For such an application, samples are often sequenced to

high coverage. Although it is not hard to achieve an error rate

one per 100 000 bases (Bentley et al., 2008), mutations occur at

a much lower rate, typically of the order of 10

−6

or even 10

−7

Naively calling genotypes and then comparing samples frequently

would not work well (Ajay et al., 2011), because subtle uncertainty

at Shanghai academy of science & technology on May 13, 2014http://bioinformatics.oxfordjournals.org/Downloaded from

下载后可阅读完整内容，剩余6页未读，立即下载

rained_NGS

粉丝: 0
资源: 1

NGS数据分析：统计框架下的SNP呼叫与突变发现

细菌SNP检测新工具集：snp_calling_scripts介绍

NHSA-DHSC算法：使用熵值法检测复杂疾病相关SNP组合

SNP指纹图谱软件SNPT：高效率样品区分与SNP分析

GATK HaplotypeCaller SNP Calling 自动化流程

snp calling 个步骤的详细代码以及对代码的解析

用gatk进行二代测序数据snp calling流程 以及对代码的解析

SNP_calling_GATK

snp_calling_scripts:细菌 SNP 调用脚本集

面向信息SNP选择的聚类算法.pdf

论文研究-基于多元方差分析的成对肿瘤SNP array数据分段算法 .pdf

最新资源

用gatk进行二代测序数据snp calling流程以及对代码的解析