Vol. 22 no. 11 2006, pages 1325–1334
doi:10.1093/bioinformatics/btl094
BIOINFORMATICS ORIGINAL PAPER
Sequence analysis
Combining multi-species genomic data for microRNA
identification using a Naı
¨
ve Bayes classifier
Malik Yousef
1
, Michael Nebozhyn
1
, Hagit Shatkay
2
, Stathis Kanterakis
1
,
Louise C. Showe
1
and Michael K. Showe
1,
1
The Wistar Institute, Philadelphia, PA 19104, USA and
2
School of Computing, Queen’s Univer sity,
Kingston, Ontario, Canada
Received on December 1, 2005; revised on February 21, 2006; accepted on March 9, 2006
Advance Access publication March 16, 2006
Associate Editor: Keith A Crandall
ABSTRACT
Motivation: Most computational methodologies for microRNA gene
prediction utilize techniques based on sequence conservation and/or
structural similarity. In this study we describe a new technique, which is
applicable across several species, for predicting miRNA genes. This
technique is based on machine learning, using the Naı
¨
ve Bayes clas-
sifier. It automatically generates a model from the training data, which
consists of sequence and structure information of known miRNAs from a
variety of species.
Results: Our study shows that the application of machine learning
techniques, along with the integration of data from multiple species is
a useful andgeneral approach for miRNA gene prediction. Based on our
experiments, we believe that this new technique is applicable to an
extensive range of eukaryotes’ genomes. Specific structure and
sequence features are first used to identify miRNAs followed by a
comparative analysis to decrease the number of false positives
(FPs). The resulting algorithm exhibits higher specificity and similar
sensitivity compared to currently used algorithms that rely on
conserved genomic regions to decrease the rate of FPs.
Availability: The BayesMiRNAfind program is available at http://
wotan.wistar.upenn.edu/miRNA
Contact: showe@wistar.org
Supplementary information: Supplementary data are available at
Bioinformatics online.
INTRODUCTION
MicroRNAs (miRNAs) are single-stranded, non-coding RNAs
averaging 21 nt in length. The mature miRNA is cleaved from a
70–110 nt ‘hairpin’ precursor with a double-stranded region con-
taining one or more single-stranded loops. MiRNAs target messen-
ger RNAs (mRNAs) for cleavage, repressing translation and
causing nascent protein degradation (Bartel, 2004).
Several computational approaches have been implemented for
miRNA gene prediction using methods based on sequence
conservation and/or structural similarity (Lim et al., 2003a, b;
Weber, 2005; Lai et al., 2003; Grad et al., 2003). Lim and others
(Lim et al., 2003a, b; Weber, 2005) developed a program, for
identification of miRNAs, called MiRscan with a 70% specificity
at a sensitivity of 50%. MiRscan uses seven miRNA features with
associated weights to build a computational tool, which assigns
scores to hairpin candidates. The weights are estimated using stat-
istics based on the previously known miRNAs from Caenorhabditis
elegans. Grad et al. (2003) developed a computational method using
sequence conservation and structural similarity to predict miRNAs
in the C.elegans genome. Lai et al. (2003) used similar ideas to
develop a different computational tool for the Drosophila genome,
called miRseeker. These efforts have recently been reviewed by
Bartel (2004). Others used homology searches for revealing paralog
and ortholog miRNAs (Weber, 2005; Lagos-Quintana, 2001; Lau
et al., 2001; Lee and Ambros, 2001; Pasquinelli et al ., 2000). In
addition, Wang et al. (2005) developed a method based on sequence
and structure alignment for miRNA identification. The most
recent published work of which we are aware that uses machine
learning for miRNA discovery is by Nam et al. (2005). They
constructed a highly specific probabilistic model (HMM) whose
topology and states are handcrafted based on prior knowledge
and assumptions, and whose exact probabilities are derived from
the data.
In our study we present a machine learning approach based on the
Naı
¨
ve Bayes classifier for predicting miRNA genes. Our method
differs from previous efforts in two ways: (1) we generate the model
automatically and identify rules based on the miRNA gene structure
and sequence allowing prediction of non-conserved miRNAs and
(2) we use a comparative analysis over multiple species to
reduce the false positive (FP) rate. This allows for a trade-off
between sensitivity and specificity. Based on our experiments
with multiple genomes we believe that our method is applicable
to a wide variety of eukaryotes. The resulting algorithm demon-
strates higher specificity and similar sensitivity compared to
currently used algorithms, which use conserved genomic regions
to reduce FPs (Lim et al., 2003a, b; Lai et al., 2003; Grad et al.,
2003).
Like Nam et al. (2005), rather than relying on miRNAs homology
between related species, we directly use features of the miRNA
sequence and secondary structure. However in contrast to them,
we train a Naı
¨
ve Bayes classifier to identify miRNAs directly
from the data. In our system prior knowledge is used for initial
filtering of the data, but not for constructing the model. The Naive
Bayes classifier is a standard model with no domain-specific
assumptions (aside for the usual conditional independence assump-
tions inherent to the model). In addition, whereas Nam’s model was
trained and tested on a single type of data (136 Human miRNAs)
To whom correspondence should be addressed.
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1325