RESEARCH Open Access
iSeeRNA: identification of long intergenic
non-coding RNA transcripts from
transcriptome sequencing data
Kun Sun
1,2
, Xiaona Chen
1,3
, Peiyong Jiang
1,2
, Xiaofeng Song
4*
, Huating Wang
1,3*
, Hao Sun
1,2*
From ISCB-Asia 2012
Shenzhen, China. 17-19 December 2012
Abstract
Background: Long intergenic non-coding RNAs (lincRNAs) are emerging as a novel class of non-coding RNAs and
potent gene regulators. High-throughput RNA-sequencing combined with de novo assembly promises quantity
discovery of novel transcripts. However, the identification of lincRNAs from thousands of assembled transcripts is
still challenging du e to the difficulties of separating them from protein coding transcripts (PCTs).
Results: We have implemented iSeeRNA, a support vector machine (SVM)-based classifier for the identification of
lincRNAs. iSeeRNA shows better performance compared to other software. A public avai lable webserver for
iSeeRNA is also provided for small size dataset.
Conclusions: iSeeRNA demonstrates high pre diction accura cy and runs several magnitudes faster than other
similar programs. It can be integrated into the transcriptome data analysis pipelines or run as a web server, thus
offering a valuable tool for lincRNA study.
Background
Over the past decade, e vidence from numerous high-
throughput genomic platforms reveals that even though
less than 2% of the mammalian genome encodes proteins,
a significant fracti on can be transcribed into different
complex families of non-coding RNAs (ncRNAs) [1-4].
Other than microRNAs and other families of small non-
coding RNAs, long non-coding RNAs (lncRNAs, >200nt)
are emerging as potent regulators of gene expression [5].
Originally identified by Guttman et al. [6] from four
mouse cell types using chromatin state maps as a subtype
of lncRNAs, long intergenic non-coding RNAs (lincRNAs),
are discrete transcriptional unit intervening known pro-
tein-coding loci. Recent studies demonstrate the functional
significance of lincRNAs. However, it remains a daunting
task to identify all the lincRNAs existent in various biolo-
gical processes and systems.
Whole transcriptome sequencing , known as RNA- Seq,
offers the promise of rapid comprehensive discovery of
novel genes and transcripts [7]. With the de novo assembly
software such as Cufflinks [8] and Scripture [6], a large set
of novel assemblies can be obtained from RNA-Seq data.
Several programs have been used to facilitate the catalo-
ging of lincRNAs from RNA-Seq assemblies. For example,
Li et al. [9] used Codon Substitution Frequency (CSF)
score [10] to identify lincRNAs from de novo assembled
transcripts in chicken skeletal muscle. Pauli et al. [11] took
advantage of PhyloCSF score [12] followed by other filter-
ing steps to identify lincRNAs expressed during zebrafish
embryogenesis. Cabili et al. [13] also use d PhyloCSF pro-
gram to eliminate the de novo assembled transcripts with
positive coding potential and identified ~8200 lincRNA
loci in 24 human tissues. However, the extremely high
computational times demanded by PhyloCSF, may become
the bottleneck for handling millions of assemblies gener-
ated from high throughput sequencing. Furthermore,
* Correspondence: xfsong@nuaa.edu.cn; huating.wang@cuhk.edu.hk;
haosun@cuhk.edu.hk
1
Li Ka Shing Institute of Health Sciences, The Chinese University of Hong
Kong, Shatin, New Territories, Hong Kong SAR, China
4
Department of Biomedical Engineering, Nanjing University of Aeronautics
and Astronautics, Nanjing 210016, China
Full list of author information is available at the end of the article
Sun et al. BMC Genomics 2013, 14(Suppl 2):S7
http://www.biomedcentral.com/1471-2164/14/S2/S7
© 2013 Sun et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.