iSeeRNA:利用SVM算法高效识别转录组测序数据中的长非编码RNA

122 浏览量更新于2024-08-27 收藏 1.73MB PDF 举报

iSeeRNA是一项重要的研究，旨在解决从转录组测序数据中识别长间隔非编码RNA（lincRNAs）这一挑战性问题。lincRNAs是一类新兴的非编码RNA，它们作为强大的基因调控因子，对生物学过程起着关键作用。随着高通量RNA测序技术的发展，通过组装新发现的转录本成为可能，但如何从众多组装的转录片段中准确区分lincRNAs与蛋白质编码转录本（PCTs）仍然是一个亟待解决的问题。这项研究的成果是iSeeRNA，一个基于支持向量机（SVM）的分类器。SVM是一种机器学习算法，以其在模式识别和分类任务中的高效性能而著称。iSeeRNA的设计目的是利用其高级的特征选择和分类能力，有效地识别那些在序列特性上与PCTs有显著差异的lincRNAs。 iSeeRNA的优势在于它展示出优于其他现有软件的预测性能。其核心在于利用深度学习技术和统计分析方法，能够更精确地识别那些在基因组上的位置特征、剪接模式以及转录本长度等方面与lincRNAs更为匹配的候选转录片段。此外，为了方便用户特别是研究者们使用，研究团队还开发了一个公共的在线服务器，即使对于小型数据集，也能提供便捷的服务。该研究的重要结论是，iSeeRNA不仅具有很高的预测准确性，而且运行速度显著快于同类其他程序。这意味着它能够在大规模的转录组数据分析中节省时间和计算资源，极大地提高了lincRNA研究的效率。整合到现有的生物信息学工作流程中，iSeeRNA可以成为研究人员识别和理解lincRNA功能的重要工具，推动了非编码RNA领域尤其是lincRNA研究的前沿进展。

RESEARCH Open Access

iSeeRNA: identification of long intergenic

non-coding RNA transcripts from

transcriptome sequencing data

Kun Sun

1,2

, Xiaona Chen

1,3

, Peiyong Jiang

1,2

, Xiaofeng Song

, Huating Wang

1,3*

, Hao Sun

1,2*

From ISCB-Asia 2012

Shenzhen, China. 17-19 December 2012

Abstract

Background: Long intergenic non-coding RNAs (lincRNAs) are emerging as a novel class of non-coding RNAs and

potent gene regulators. High-throughput RNA-sequencing combined with de novo assembly promises quantity

discovery of novel transcripts. However, the identification of lincRNAs from thousands of assembled transcripts is

still challenging du e to the difficulties of separating them from protein coding transcripts (PCTs).

Results: We have implemented iSeeRNA, a support vector machine (SVM)-based classifier for the identification of

lincRNAs. iSeeRNA shows better performance compared to other software. A public avai lable webserver for

iSeeRNA is also provided for small size dataset.

Conclusions: iSeeRNA demonstrates high pre diction accura cy and runs several magnitudes faster than other

similar programs. It can be integrated into the transcriptome data analysis pipelines or run as a web server, thus

offering a valuable tool for lincRNA study.

Background

Over the past decade, e vidence from numerous high-

throughput genomic platforms reveals that even though

less than 2% of the mammalian genome encodes proteins,

a significant fracti on can be transcribed into different

complex families of non-coding RNAs (ncRNAs) [1-4].

Other than microRNAs and other families of small non-

coding RNAs, long non-coding RNAs (lncRNAs, >200nt)

are emerging as potent regulators of gene expression [5].

Originally identified by Guttman et al. [6] from four

mouse cell types using chromatin state maps as a subtype

of lncRNAs, long intergenic non-coding RNAs (lincRNAs),

are discrete transcriptional unit intervening known pro-

tein-coding loci. Recent studies demonstrate the functional

significance of lincRNAs. However, it remains a daunting

task to identify all the lincRNAs existent in various biolo-

gical processes and systems.

Whole transcriptome sequencing , known as RNA- Seq,

offers the promise of rapid comprehensive discovery of

novel genes and transcripts [7]. With the de novo assembly

software such as Cufflinks [8] and Scripture [6], a large set

of novel assemblies can be obtained from RNA-Seq data.

Several programs have been used to facilitate the catalo-

ging of lincRNAs from RNA-Seq assemblies. For example,

Li et al. [9] used Codon Substitution Frequency (CSF)

score [10] to identify lincRNAs from de novo assembled

transcripts in chicken skeletal muscle. Pauli et al. [11] took

advantage of PhyloCSF score [12] followed by other filter-

ing steps to identify lincRNAs expressed during zebrafish

embryogenesis. Cabili et al. [13] also use d PhyloCSF pro-

gram to eliminate the de novo assembled transcripts with

positive coding potential and identified ~8200 lincRNA

loci in 24 human tissues. However, the extremely high

computational times demanded by PhyloCSF, may become

the bottleneck for handling millions of assemblies gener-

ated from high throughput sequencing. Furthermore,

* Correspondence: xfsong@nuaa.edu.cn; huating.wang@cuhk.edu.hk;

haosun@cuhk.edu.hk

Li Ka Shing Institute of Health Sciences, The Chinese University of Hong

Kong, Shatin, New Territories, Hong Kong SAR, China

Department of Biomedical Engineering, Nanjing University of Aeronautics

and Astronautics, Nanjing 210016, China

Full list of author information is available at the end of the article

Sun et al. BMC Genomics 2013, 14(Suppl 2):S7

http://www.biomedcentral.com/1471-2164/14/S2/S7

Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in

any medium, provided the original work is properly cited.

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38711041

粉丝: 6
资源: 954

iSeeRNA:利用SVM算法高效识别转录组测序数据中的长非编码RNA

带你走进神秘的长链非编码RNA.pdf

音视频-编解码-非编码RNAHOTAIRM1和m省略8a在结直肠癌中的表达与功能研究.pdf

长链非编码RNA的作用机制.pdf

Python库 | parasail-1.1.7-py2.py3-none-win32.whl

which_tree:系统发育推断方法的测试

pybedtools:适用于Aaron Quinlan的BEDTools（生物信息学工具）的Python包装器以及更多内容

lncRNA：非编码RNA的世界与生物学影响

ANRIL调控CDKN2A-2B基因簇：肝癌细胞增殖与迁移的关键

揭示库尔勒香梨亲本之谜：分子标记探究与鸭梨关系

小麦-长穗偃麦草染色体特异分子标记的开发与应用

最新资源