LectureNoteSeriesonComputationalandSystemsEngineering(LCSE)
InternationalConferenceonComputersandSystemsEngineeringandApplications
Pages:33‐37
33
A Repeat Finder Algorithm Based On Next
Generation Sequencing Data
Shuaibin Lian
School of Information Science and Technology
Sun Yat-sen University
Guangzhou, China
Shuai_lian@qq.com
Xianhua Dai
School of Information Science and Technology
Sun Yat-sen University
Guangzhou, China
issdxh@mail.sysu.edu.cn.
Abstract—Repetitive sequences of varying lengths are very
common in almost all eukaryotes genomes, most of them are
presumed to have some very important regulatory functions and
can cause some genomic instabilities. So the identification of these
repetitive genomes is a critical part of some further analysis.
Unfortunately, the repeats detection methods are all based on
reference genomes or the combination of reference genome and
repeat database at present. Up to now, there is no repeat
detection method which is specially designed for assembling
repetitive contents from NGS data directly. To overcome this
problem, a new repeat finder method is proposed in this paper,
named NGSRepeatFinder, and which is specially designed for
assembling repeats from NGS data directly.
NGSRepeatAssembler has two important properties: 1)detecting
repeats; 2)estimating copy numbers. NGSRepeatAssembler can
detect repeats by assembling the repetitive contents and
estimating their copy numbers in high coverage depth. The
performances of NGSRepeatAssembler are evaluated in
simulated datasets and real reference datasets. Results show that
the accuracies of assembling repeats and estimating copy
numbers are as high as 99% and 100% respectively under the
condition of high coverage.
Keywords—
Interspersed Repeats; Tandem Repeat;, Next
Generation Sequencing(NGS;, Repeat Finder
I. INTRODUCTION
The genomes of all eukaryotes contain repetitive elements of
varying lengths that can occupy a significant fraction of the
total DNA content[1], e.g. ~20% of Caenorhabdits elegans and
Caenorhabditis briggsae genomes[2] and ~50% of the human
genome[3] have been identified as repetitive DNA. Since
repetitive DNA sequences are presumed to be important in a
number of regulatory functions [4], and are one of the
principal causes of genomic instability. Recent molecular
evidence also suggests that some repeat elements may be
instrumental in generation of new genes [5]. Regardless, a
comprehensive understanding of gene and genome function in
eukaryotes will require knowledge of repeat sequences,
because eukaryotic genes evolve and function within the
context of a chromosomal milieu-composed primarily of
repetitive DNA. There are two major groups of repeats in
eukaryotic genomes: tandem repeats and interspersed repeats
[6]. Tandem repeats are grouped into three major subclasses:
satellites, mini-satellites and microsatellites; Likewise,
interspersed repeats can also be sub-grouped into five types:
Short Interspersed Nuclear Elements (SINEs), Long
Interspersed Nuclear Elements (LINEs), Long Terminal
Repeats (LTRs), DNA transposons and others. Furthermore,
these repeats are greatly challenging repeat finders and
genome assemblers[7]. So repeat identification is a critical
part of the analysis of a new sequenced genome and is of
considerable importance. Next Generation Sequencing
Technologies provide an opportunity for finding these repeats
directly from NGS data without the help of reference genomes
and repeat databases.
Over the past twenty years, genome sequencing
technologies have made a great progress in many aspects, such
as speed, cost, coverage, and etc. Unfortunately, although
there are tens of genome assembly algorithms and software.
However, among these existing assembly algorithms, there is
no genome assembler which is specially designed for
assembling repeats from NGS data directly to our knowledge.
At the same time, these existing repeat finders, such as:
tandem repeats finder [8], repeat masker [9], RepeatScout [10]
and etc., are based on reference genomes or with the help of
repeat database rather than the NGS data itself. So if the
reference genome is not available or the genome of a new
species contains some structure variants, how to detect the
repetitive genome structures challenges genome assemblers
and repeat finders.
To overcome this problem, a repeat finder algorithm,
named NGSRepeatFinder, is proposed in this paper.
NGSRepeatFinder can detect repeats by assembling the
repetitive contents from NGS data directly without the priori
knowledge of reference genome. The assembling process of
NGSRepeatFinder is based on the combination of sliding
window and statistic of coverage depth. Finally, the
performances of NGSRepeatFinder are evaluated in simulated
datasets and real reference datasets. Results show that the
accuracy of finding repeats and estimating their copies are up
100% in ideal situations. What’s more, the effect of the special
parameters of NGSRepeatFinder, such as sequencing depth and