NGSRepeatFinder：下一代测序数据的创新重复序列检测算法

48 浏览量更新于2024-08-26 收藏 228KB PDF 举报

"NGSRepeatFinder是一种新型的重复序列查找算法，专为直接处理下一代测序（Next Generation Sequencing，NGS）数据而设计，旨在识别和估算基因组中的重复序列和拷贝数。该方法弥补了当前依赖参考基因组或重复数据库的检测方法的不足，尤其适用于高覆盖率的数据。" 在生物信息学领域，重复序列是基因组研究中的重要组成部分，因为它们在真核生物基因组中广泛存在，并可能参与多种生物学过程，如基因调控和基因组稳定性。交错重复（Interspersed Repeats）和串联重复（Tandem Repeat）是两种主要的重复序列类型，它们在基因组结构和功能上扮演着关键角色。交错重复是指在基因组中分散分布的重复序列，而串联重复则是指连续排列的相同或高度相似的序列。传统的重复序列检测方法通常依赖于已知的参考基因组或者与重复序列数据库的比对，但这限制了对未知或异质性基因组的研究。随着NGS技术的发展，可以直接获取大量的基因组测序数据，但如何高效准确地从这些数据中解析出重复序列成为新的挑战。NGSRepeatFinder的出现解决了这一问题，它能直接处理NGS数据，无须依赖参考基因组，从而提高了对未注释基因组的重复序列检测能力。 NGSRepeatFinder算法包括两个核心特性：重复序列的检测和拷贝数的估计。首先，该算法通过组合高覆盖深度的序列片段来识别重复区域，这使得它能够在高复杂度的基因组数据中有效地组装重复序列。其次，通过分析这些组装的重复序列的覆盖度，它能够准确估计每个重复序列的拷贝数。在模拟数据集和实际参考数据集上的测试结果显示，NGSRepeatFinder在高覆盖率条件下，重复序列组装的准确率达到了99%，拷贝数估计的精确度更是高达100%。这种创新的方法不仅有助于深入理解基因组的复杂结构，还为研究基因组变异、疾病关联分析以及进化研究提供了新的工具。在未来的应用中，NGSRepeatFinder有望被广泛应用于基因组学和表观基因组学研究，特别是在没有参考基因组或具有大量未知重复序列的物种中，它的价值将更加凸显。

LectureNoteSeriesonComputationalandSystemsEngineering(LCSE)

InternationalConferenceonComputersandSystemsEngineeringandApplications

Pages:33‐37



A Repeat Finder Algorithm Based On Next

Generation Sequencing Data

Shuaibin Lian

School of Information Science and Technology

Sun Yat-sen University

Guangzhou, China

Shuai_lian@qq.com

Xianhua Dai

School of Information Science and Technology

Sun Yat-sen University

Guangzhou, China

issdxh@mail.sysu.edu.cn.

Abstract—Repetitive sequences of varying lengths are very

common in almost all eukaryotes genomes, most of them are

presumed to have some very important regulatory functions and

can cause some genomic instabilities. So the identification of these

repetitive genomes is a critical part of some further analysis.

Unfortunately, the repeats detection methods are all based on

reference genomes or the combination of reference genome and

repeat database at present. Up to now, there is no repeat

detection method which is specially designed for assembling

repetitive contents from NGS data directly. To overcome this

problem, a new repeat finder method is proposed in this paper,

named NGSRepeatFinder, and which is specially designed for

assembling repeats from NGS data directly.

NGSRepeatAssembler has two important properties: 1)detecting

repeats; 2)estimating copy numbers. NGSRepeatAssembler can

detect repeats by assembling the repetitive contents and

estimating their copy numbers in high coverage depth. The

performances of NGSRepeatAssembler are evaluated in

simulated datasets and real reference datasets. Results show that

the accuracies of assembling repeats and estimating copy

numbers are as high as 99% and 100% respectively under the

condition of high coverage.

Keywords—

Interspersed Repeats; Tandem Repeat;, Next

Generation Sequencing(NGS;, Repeat Finder

I. INTRODUCTION

The genomes of all eukaryotes contain repetitive elements of

varying lengths that can occupy a significant fraction of the

total DNA content[1], e.g. ~20% of Caenorhabdits elegans and

Caenorhabditis briggsae genomes[2] and ~50% of the human

genome[3] have been identified as repetitive DNA. Since

repetitive DNA sequences are presumed to be important in a

number of regulatory functions [4], and are one of the

principal causes of genomic instability. Recent molecular

evidence also suggests that some repeat elements may be

instrumental in generation of new genes [5]. Regardless, a

comprehensive understanding of gene and genome function in

eukaryotes will require knowledge of repeat sequences,

because eukaryotic genes evolve and function within the

context of a chromosomal milieu-composed primarily of

repetitive DNA. There are two major groups of repeats in

eukaryotic genomes: tandem repeats and interspersed repeats

[6]. Tandem repeats are grouped into three major subclasses:

satellites, mini-satellites and microsatellites; Likewise,

interspersed repeats can also be sub-grouped into five types:

Short Interspersed Nuclear Elements (SINEs), Long

Interspersed Nuclear Elements (LINEs), Long Terminal

Repeats (LTRs), DNA transposons and others. Furthermore,

these repeats are greatly challenging repeat finders and

genome assemblers[7]. So repeat identification is a critical

part of the analysis of a new sequenced genome and is of

considerable importance. Next Generation Sequencing

Technologies provide an opportunity for finding these repeats

directly from NGS data without the help of reference genomes

and repeat databases.

Over the past twenty years, genome sequencing

technologies have made a great progress in many aspects, such

as speed, cost, coverage, and etc. Unfortunately, although

there are tens of genome assembly algorithms and software.

However, among these existing assembly algorithms, there is

no genome assembler which is specially designed for

assembling repeats from NGS data directly to our knowledge.

At the same time, these existing repeat finders, such as:

tandem repeats finder [8], repeat masker [9], RepeatScout [10]

and etc., are based on reference genomes or with the help of

repeat database rather than the NGS data itself. So if the

reference genome is not available or the genome of a new

species contains some structure variants, how to detect the

repetitive genome structures challenges genome assemblers

and repeat finders.

To overcome this problem, a repeat finder algorithm,

named NGSRepeatFinder, is proposed in this paper.

NGSRepeatFinder can detect repeats by assembling the

repetitive contents from NGS data directly without the priori

knowledge of reference genome. The assembling process of

NGSRepeatFinder is based on the combination of sliding

window and statistic of coverage depth. Finally, the

performances of NGSRepeatFinder are evaluated in simulated

datasets and real reference datasets. Results show that the

accuracy of finding repeats and estimating their copies are up

100% in ideal situations. What’s more, the effect of the special

parameters of NGSRepeatFinder, such as sequencing depth and

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38502428

粉丝: 6
资源: 886

NGSRepeatFinder：下一代测序数据的创新重复序列检测算法

论文研究-面向下一代测序技术的结构变异检测算法综述.pdf

ANGSD-Assignments-For-Class:下一代测序数据的2021年Spring课程分析的所有作业

大型DNA数据集的一种有效的主题查找算法

基于deBruijn图的算法概述.doc

使用高通量测序读数识别微转化

TreQ:耐Indel的读映射器-开源

探索算法世界：Algorithms Unplugged深度解析

MATLAB生物信息学应用：探索生物数据，揭示生命奥秘（3个实战案例）

Matlab界面面板版车牌识别系统设计实现[Matlab界面面板版].zip

SLAM-基于深度特征的实时SLAM算法实现-效果好于ORB-优质项目实战.zip

最新资源