使用决策树加速下一代测序数据中的变体调用研究

200 浏览量更新于2024-08-26 收藏 997KB PDF 举报

"这篇研究论文探讨了一种基于决策树的变体调用算法，用于从下一代测序（NGS）数据中快速而准确地识别基因变异。该算法在低覆盖度数据上的表现尤为出色，且比现有的三个广泛使用的工具更快。" 在当前的生物信息学领域，随着下一代测序技术的飞速发展，海量的测序数据被不断生成。然而，这些数据的分析工作，尤其是对于低覆盖率的数据，仍然是一个具有挑战性的任务。主要原因在于缺乏能够同时兼顾速度和精度的智能工具。该研究提出了一种创新的决策树为基础的变体调用算法，旨在解决这一问题。决策树是一种广泛应用的机器学习模型，它通过一系列的规则和条件分割数据，从而达到分类或预测的目的。在基因组分析中，决策树可以用来识别单核苷酸变异（SNVs）和插入/缺失（indels）等遗传变异。实验结果显示，该算法在真实数据集上表现出了高精度和敏感性，这意味着它能有效地检测到各种类型的基因变异，并且在处理低覆盖率数据时表现出良好的适应性。这一点对于处理那些由于样本量有限或测序深度不足而导致的数据集特别重要。值得注意的是，与Platypus、GATK的Unified Genotyper、Haplotype Caller以及SAMtools这四个知名的变体调用工具相比，该算法在实验中显示出了显著的速度优势。这些工具都是当前广泛使用的变体识别软件，但它们在处理大量数据或低覆盖率数据时可能存在效率问题。研究人员将他们的算法实现为一个名为“Fuwa”的软件，并将其与其他四个工具一起应用于对一个深研究样本NA12878的三种测序数据集进行变体调用。这些数据集分别来自全基因组测序、全外显子测序和低覆盖率测序，以全面评估新算法的性能和适用性。这项研究提供了一个快速且高效的变体调用解决方案，尤其适用于处理大规模的NGS数据和低覆盖率数据。这对于提升基因组数据分析的效率，加快生物医学研究的进程，以及在临床应用中尽早发现遗传疾病标志具有重要意义。未来的工作可能会进一步优化这个决策树算法，提高其在复杂基因变异识别中的性能，并可能推动整个生物信息学领域的进步。

CIGAR field. Each candidate variant is marked with a

quality metric “qual” valuing 1 or 0 according to whether

the candidate variant is in dbSNP. Then, a decision-tree

model is trained using the feature vectors of candidate

variants as the training set. After the model is trained,

candidate variants with similar feature values are

grouped into a same leaf node and are treated as a unit.

For all the candidates in a leaf, if their average qual is

higher than the threshold, they are called out; otherwise,

they are identified as false positives. Finally, a simple and

effective genotyper is applied.

Generating and labelling candidate variants

Fuwa walks through the whole-genome sequence, gener-

ating candidate variants at each locus. Designed for high

sensitivity, Fuwa considers all 6 possible candidate

variants (i.e., A, T, G, C, insertion, deletion), and only

those with too low a proportion of read depth at their

loci are excluded. Feature values of these candidates are

also calculated. At the same time, the programme

searches dbSNP and labels each candidate with dbSNP

quality, or “qual” in short. Qual is set to 1 if the candi-

date exists in dbSNP and 0 if not. To improve search

speed, Fuwa preloads dbSNP into RAM and transforms

it into a hash table so that any searching can be finished

in a constant time. After this step, all candidate variants

are obtained and labelled.

To date, most common human variants have already

been catalogued in dbSNP. The high coverage rate of

SN Vs and short indels qualifies dbSNP as a powerful

benchmark in alignment result recalibration [7] and final

call set quality assessment [5, 7, 11] as well as in training

machine learning models.

Decision tree and feature selection

Classification and regression tree (CART) [12]isa

widely used training algorithm of decision tree that can

be applied to either classification or regression problems.

It assumes the decision tree to be binary, and each non-

leaf node is measured by a Boolean expression so that

the input samples could be transferred into two

branches: the left branch if the Boolean expression is

“true” or the right branch otherwise. We chose CART

because it is simple and fast, and the decision procedure

can be easily understood.

Twelve features were sele cted to train the CART

model, which were divided into four categories, shown

as follows.

Category I. Read depth

Features under this category measure the absolute depth

and depth ratio of reads that are “effective” to be a spe-

cific candidate variant. “Effective ” means that the read

shares the same base as the candidate variant at the can-

didate’s locus.

Feature 1: effective base depth Effective Base Depth

(EBD) is the sum of the depths of effective reads. For

indel reads, the EBD equals the mapping quality, while

for SNV reads, the EBD is the value of the mapping

quality multiplied by the base quality.

Feature 2: effective base depth ratio The EBD ratio, i.

e., the EBD of one candidate variant divided by the sum

of the EBDs of all candidate variants at that locus. If this

indicator is very low, the related candidate variant tends

to be a random error.

Feature 3: DeltaL DeltaL is a statistic describing the

difference between optimal and suboptimal genotypes.

Fuwa first hypothesizes that the variant is true, so the

reads covering this locus obey an almost ideal variant

model: 0/1 or 1/1. The logarithms of likelihood under

these two ideal models are calculated separately, and the

bigger one is selected as L

. Then, Fuwa calcu lates the

second likelihood logarithm, L

, under another hypoth-

esis that the variant is false and tha t reads covering this

locus follow the binomial distribu tion model. Thus , L

, or DeltaL, is the logarithm of the ratio of the first

and second likelihoods. If DeltaL is close to 0, which

means the likelihoods of the ideal model and the bino-

mial model are nearly equal, we empirically judged the

variant to be false positive; otherwise, the variant tends

to be true.

Category II. Base quality

This category focuses on the accuracy of a base

sequenced by the sequencing machine, which ha s con-

siderable impact on variant calling.

Feature 4: Sum of Base Quality (SumBQ) This feature

is the sum of the base quality of effective reads for one

candidate variant. For indel reads, this value is set to 30

empirically.

Feature 5: Average Mapping Quality (AveBQ) By div-

iding SumBQ by the number of effective reads, we ob-

tain the average mapping quality.

Feature 6: Variance of Position (VarPos) Here, “pos-

ition” means the offset of the pile-up site from the 3′

end of a read. We use this statistic considering that, gen-

erally, sequencing quality declines towards the end of a

read; thus, candidate variants that are close to the 3′

end are more likely to be sequencing errors.

Li et al. BMC Bioinformatics (2018) 19:145 Page 3 of 14

剩余13页未读，继续阅读

weixin_38591291

粉丝: 6

使用决策树加速下一代测序数据中的变体调用研究

DeepVariant是一个分析管道，它使用深度神经网络从下一代DNA测序数据中调用遗传变异。-Python开发

最新高通量测序技术(下一代测序技术)原理及应用[汇编].pdf

基因芯片及新一代测序数据分析基础.ppt

细菌16Srna测序数据分析

二代测序数据处理数据质量控制

双端测序数据合并

RNA测序下机清理数据和序列对比

16s扩增子多样性测序平台、测序数据量选择汇总

如何使用SRAToolkit下载NCBI SRA数据库中的特定二代测序数据集，并进行序列读取和比对信息分析？

宏基因组测序数据分析

最新资源