RESEARCH Open Access
SeedsGraph: an efficient assembler for next-
generation sequencing data
Chunyu Wang
1*
, Maozu Guo
1*
, Xiaoyan Liu
1
, Yang Liu
1
, Quan Zou
2
From The 4th Translational Bioinformatics Conference and the 8th International Conference on Systems
Biology (TBC/ISB 2014)
Qingdao, China. 24-27 October 2014
Abstract
DNA sequencing technology has been rapidly evolving, and produces a large number of short reads with a fast
rising tendency. This has led to a resurgence of research in whole genome shotgun assembly algorithms. We start
the assembly algorithm by clustering the short reads in a cloud computing framework, and the clustering process
groups fragments according to their original consensus long- sequence similarity. We condense each group of
reads to a chain of seeds, which is a kind of substring with reads aligned, and then build a graph accordingly.
Finally, we analyze the graph to find Euler paths, and assemble the reads related in the paths into contigs, and
then lay out contigs with mate-pair information for scaffolds. The result shows that our algorithm is efficient and
feasible for a large set of reads such as in next-generation sequencing technology.
Introduction
The introduction of the massively parallel next-generation
sequencing (NGS) technologies has caused a great increase
in the number of reads typically generated by experiments.
At the same time, the shorter read length from NGS and
the sheer demand for more scalable assemblers have been
an important computational challenge, and the genome
assembly cont inues to represent one of the most difficult
and important algorithmic problems in b ioinformatics.
Software technology and algorithm implementation
become critical factors when dealing with terabytes of
data. Cloud computing as a brand new way of dealing with
an extremely large dataset offers a good chance for bioin-
formatics data processing. The abil ity and feasibility for
underlying applications have been discussed [1,2].
We design a graph-based method for the NGS reads
assembly pro blem and im plement it as a software pack-
age, SeedsGraph. In the Background section, the NGS
reads assembly problem and the framework for cloud
computing are discussed. The Algorithm section presents
the seeds definition and the related algorithms. The
result of the experiments is presented in the Result sec-
tion. Then , finally, there is a discussion about the assem-
bly and results in Discussion and future work.
Background
Genetic information of living organisms is stored in a
chain of DNA molecules. There a re four possible small
molecules (also called nucleotides or bases): adenine
(A), cytosine (C), guanine (G) and thymine ( T). With
the four-letter alphabet {A, T, G, C } we can represent
the entire genetic information in strings. DNA mole-
cules are deno ted as a long string from the alphabet,
duplicated and broken into fragments randomly for
sequencing, which is also called shotgun sequencing.
The whole genome sho tgun (WGS) de novo assembly
problem is t he reconstr uction of the genetic sequence
information from a set of reads sequenced from the
fragments. The shotgun process takes reads from ran-
dom positions along a target molecule [3]. The WGS de
novo assembly refers to the reconstruction in its pure
form, without consultation to previously resolved
sequence. For NGS data, this is a specialized problem
due to the short length of reads and t he large volumes
of NGS data.
* Correspondence: chunyu@hit.edu.cn; maozuguo@hit.edu.cn
1
School of Computer Science and Technology, Harbin Institute of
Technology, No.92 West Dazhi Street, Nangang District, Harbin 150001,
China
Full list of author information is available at the end of the article
Wang et al. BMC Medical Genomics 2015, 8(Suppl 2):S13
http://www.biomedcentral.com/1755-8794/8/S2/S13
© 2015 Wang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (htt p://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original w ork is properly cited. The Creative Commons Public Domain Dedication waiver (http://
creativecommons.org/pu blicdomain/zero/1.0/) applies to the data made available in this article, u nless otherwise stated.