SeedsGraph：下一代测序数据高效组装的新方法

160 浏览量更新于2024-08-30 收藏 1.07MB PDF 举报

"SeedsGraph是针对下一代测序数据的高效汇编程序，旨在处理大量短读数据并构建基因组组装。该程序利用云计算框架对短读进行聚类，并基于原始共识长序列相似性将片段分组。每个聚类被压缩成种子链，即由对齐的读取子串组成的字符串，然后据此构建图。最后，通过对图进行分析找到欧拉路径，将路径上的读取组装成连续片段（contigs），并利用配对末端信息布局scaffolds。实验结果显示，SeedsGraph算法在处理大规模读取集时表现出高效性和可行性，特别适用于下一代测序数据。" SeedsGraph是一个专门设计用于处理下一代测序(Next-Generation Sequencing, NGS)数据的高效基因组组装工具。随着DNA测序技术的快速发展，产生了海量的短读序列数据，这对基因组组装算法提出了新的挑战。SeedsGraph通过创新的算法来应对这些挑战。首先，它采用了云计算框架，这允许在分布式计算环境中并行处理大量数据，提高了处理速度和效率。在这一阶段，短读数据被聚类，根据它们在原始共识长序列中的相似性进行分组。这种方法有助于减少组装过程中的复杂性，同时保持了序列信息的准确性。接下来，每个聚类被压缩成“种子链”。种子链是由对齐的短读子串构成的字符串，这些子串代表了原始读取中的共享信息。这种压缩方法减少了存储需求，同时保留了足够的信息用于后续的图构建。随后，SeedsGraph根据种子链构建了一张图，每个节点代表一个种子，边则表示种子之间的连接。这个图是组装过程的核心，因为它能够捕捉到序列间的复杂关系。最后，通过分析图的结构，寻找欧拉路径。欧拉路径是一种穿过图中所有边恰好一次的路径，这在基因组组装中意味着可以连接所有相关的读取，形成连续的片段（contigs）。在找到这些路径后，SeedsGraph利用配对末端信息（mate-pair information）进一步优化contigs的布局，构建scaffolds，即更完整的基因组结构。实验证明，SeedsGraph的这种方法对于大规模的读取数据集具有高效的处理能力和实用性，特别是在处理NGS数据时。这种高效的组装算法对于基因组学研究、疾病诊断和个性化医疗等领域具有重要意义，因为它们依赖于快速、准确地解析基因组序列。

RESEARCH Open Access

SeedsGraph: an efficient assembler for next-

generation sequencing data

Chunyu Wang

, Maozu Guo

, Xiaoyan Liu

, Yang Liu

, Quan Zou

From The 4th Translational Bioinformatics Conference and the 8th International Conference on Systems

Biology (TBC/ISB 2014)

Qingdao, China. 24-27 October 2014

Abstract

DNA sequencing technology has been rapidly evolving, and produces a large number of short reads with a fast

rising tendency. This has led to a resurgence of research in whole genome shotgun assembly algorithms. We start

the assembly algorithm by clustering the short reads in a cloud computing framework, and the clustering process

groups fragments according to their original consensus long- sequence similarity. We condense each group of

reads to a chain of seeds, which is a kind of substring with reads aligned, and then build a graph accordingly.

Finally, we analyze the graph to find Euler paths, and assemble the reads related in the paths into contigs, and

then lay out contigs with mate-pair information for scaffolds. The result shows that our algorithm is efficient and

feasible for a large set of reads such as in next-generation sequencing technology.

Introduction

The introduction of the massively parallel next-generation

sequencing (NGS) technologies has caused a great increase

in the number of reads typically generated by experiments.

At the same time, the shorter read length from NGS and

the sheer demand for more scalable assemblers have been

an important computational challenge, and the genome

assembly cont inues to represent one of the most difficult

and important algorithmic problems in b ioinformatics.

Software technology and algorithm implementation

become critical factors when dealing with terabytes of

data. Cloud computing as a brand new way of dealing with

an extremely large dataset offers a good chance for bioin-

formatics data processing. The abil ity and feasibility for

underlying applications have been discussed [1,2].

We design a graph-based method for the NGS reads

assembly pro blem and im plement it as a software pack-

age, SeedsGraph. In the Background section, the NGS

reads assembly problem and the framework for cloud

computing are discussed. The Algorithm section presents

the seeds definition and the related algorithms. The

result of the experiments is presented in the Result sec-

tion. Then , finally, there is a discussion about the assem-

bly and results in Discussion and future work.

Background

Genetic information of living organisms is stored in a

chain of DNA molecules. There a re four possible small

molecules (also called nucleotides or bases): adenine

(A), cytosine (C), guanine (G) and thymine ( T). With

the four-letter alphabet {A, T, G, C } we can represent

the entire genetic information in strings. DNA mole-

cules are deno ted as a long string from the alphabet,

duplicated and broken into fragments randomly for

sequencing, which is also called shotgun sequencing.

The whole genome sho tgun (WGS) de novo assembly

problem is t he reconstr uction of the genetic sequence

information from a set of reads sequenced from the

fragments. The shotgun process takes reads from ran-

dom positions along a target molecule [3]. The WGS de

novo assembly refers to the reconstruction in its pure

form, without consultation to previously resolved

sequence. For NGS data, this is a specialized problem

due to the short length of reads and t he large volumes

of NGS data.

* Correspondence: chunyu@hit.edu.cn; maozuguo@hit.edu.cn

School of Computer Science and Technology, Harbin Institute of

Technology, No.92 West Dazhi Street, Nangang District, Harbin 150001,

China

Full list of author information is available at the end of the article

Wang et al. BMC Medical Genomics 2015, 8(Suppl 2):S13

http://www.biomedcentral.com/1755-8794/8/S2/S13

Attribution License (htt p://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in

any medium, provided the original w ork is properly cited. The Creative Commons Public Domain Dedication waiver (http://

creativecommons.org/pu blicdomain/zero/1.0/) applies to the data made available in this article, u nless otherwise stated.

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38635682

粉丝: 0
资源: 968

SeedsGraph：下一代测序数据高效组装的新方法

HomSI:来自下一代测序数据的纯合子拉伸标识符-开源

ngseasy：Dockerized下一代测序管道（QC，对齐，调用，注释）

Genobuntu:用于下一代测序的Genobuntu软件包-开源

DWGSIM:用于下一代测序的全基因组模拟器

ANGSD-assignments:分析下一代测序数据CMPB 5004 03 21年Spring

ezRun:用于分析下一代测序数据的 R 元包

FastQC:用于高通量测序数据的质量控制分析工具

bioinfo_utils：用于高通量测序数据分析的脚本集

nanopype:用于纳米Kong测序数据存档和处理的Snakemake管道

samtools:用于处理下一代测序数据的工具（使用htslib用C语言编写）

最新资源