跨物种基因组数据结合：使用朴素贝叶斯分类器识别microRNA

需积分: 3 116 浏览量更新于2024-12-04 收藏 543KB PDF 举报

"这篇论文探讨了使用朴素贝叶斯分类器结合多物种基因组数据来识别microRNA的方法。研究提出了一种适用于多个物种的新型预测技术，该技术基于机器学习，特别是利用朴素贝叶斯分类器，从已知不同物种的microRNA序列和结构信息的训练数据中自动生成模型。" 在生物信息学领域，microRNA（miRNA）是一种小型非编码RNA分子，它们在基因表达调控中扮演着关键角色。它们通过与靶标mRNA互补配对，导致翻译抑制或mRNA降解，从而参与多种生物学过程，包括发育、细胞增殖、疾病发生等。由于miRNA的生物学功能的重要性，对它们的预测和鉴定成为了一个重要的研究方向。传统的miRNA预测方法通常依赖于序列保守性和/或结构相似性。序列保守性是指在进化过程中，具有相同功能的基因在不同物种间保持一定程度的序列一致性；结构相似性则指miRNA前体在二级结构上的特征，如茎环结构。然而，这些方法可能无法充分捕捉到所有miRNA的特性，尤其是那些在序列或结构上不那么保守的miRNA。这篇论文提出的新型技术采用了机器学习策略，特别是朴素贝叶斯分类器。朴素贝叶斯算法是一种基于概率的分类方法，它假设各特征之间相互独立，并根据每个特征对类别的条件概率来进行预测。在miRNA识别的应用中，算法会学习已知miRNA的序列和结构特征，然后用这些特征来预测新的序列是否可能编码miRNA。实验结果显示，这种结合多物种数据的机器学习方法在预测准确性上可能优于仅依赖单一物种或者单一特征的方法。通过比较不同物种的数据，可以提高模型的泛化能力，识别那些在特定物种中可能不明显但在其他物种中表现出一致性的模式。此外，该方法还可能有助于发现新的miRNA家族或在进化上相对较新的miRNA，因为它们可能在某些物种中表现出不同的序列或结构特征。这为深入理解miRNA的功能多样性和进化提供了新的工具和思路。这项工作展示了如何利用跨物种信息和机器学习技术改进miRNA预测，为未来的miRNA研究和基因调控网络的理解提供了有价值的工具和理论基础。通过这种方法，科研人员可以更准确地识别潜在的miRNA，进而推动相关领域的生物学研究和临床应用。

Vol. 22 no. 11 2006, pages 1325–1334

doi:10.1093/bioinformatics/btl094

BIOINFORMATICS ORIGINAL PAPER

Sequence analysis

Combining multi-species genomic data for microRNA

identiﬁcation using a Naı

ve Bayes classiﬁer

Malik Yousef

, Michael Nebozhyn

, Hagit Shatkay

, Stathis Kanterakis

Louise C. Showe

and Michael K. Showe

1,

The Wistar Institute, Philadelphia, PA 19104, USA and

School of Computing, Queen’s Univer sity,

Kingston, Ontario, Canada

Received on December 1, 2005; revised on February 21, 2006; accepted on March 9, 2006

Advance Access publication March 16, 2006

Associate Editor: Keith A Crandall

ABSTRACT

Motivation: Most computational methodologies for microRNA gene

prediction utilize techniques based on sequence conservation and/or

structural similarity. In this study we describe a new technique, which is

applicable across several species, for predicting miRNA genes. This

technique is based on machine learning, using the Naı

ve Bayes clas-

sifier. It automatically generates a model from the training data, which

consists of sequence and structure information of known miRNAs from a

variety of species.

Results: Our study shows that the application of machine learning

techniques, along with the integration of data from multiple species is

a useful andgeneral approach for miRNA gene prediction. Based on our

experiments, we believe that this new technique is applicable to an

extensive range of eukaryotes’ genomes. Specific structure and

sequence features are first used to identify miRNAs followed by a

comparative analysis to decrease the number of false positives

(FPs). The resulting algorithm exhibits higher specificity and similar

sensitivity compared to currently used algorithms that rely on

conserved genomic regions to decrease the rate of FPs.

Availability: The BayesMiRNAfind program is available at http://

wotan.wistar.upenn.edu/miRNA

Contact: showe@wistar.org

Supplementary information: Supplementary data are available at

Bioinformatics online.

INTRODUCTION

MicroRNAs (miRNAs) are single-stranded, non-coding RNAs

averaging 21 nt in length. The mature miRNA is cleaved from a

70–110 nt ‘hairpin’ precursor with a double-stranded region con-

taining one or more single-stranded loops. MiRNAs target messen-

ger RNAs (mRNAs) for cleavage, repressing translation and

causing nascent protein degradation (Bartel, 2004).

Several computational approaches have been implemented for

miRNA gene prediction using methods based on sequence

conservation and/or structural similarity (Lim et al., 2003a, b;

Weber, 2005; Lai et al., 2003; Grad et al., 2003). Lim and others

(Lim et al., 2003a, b; Weber, 2005) developed a program, for

identiﬁcation of miRNAs, called MiRscan with a 70% speciﬁcity

at a sensitivity of 50%. MiRscan uses seven miRNA features with

associated weights to build a computational tool, which assigns

scores to hairpin candidates. The weights are estimated using stat-

istics based on the previously known miRNAs from Caenorhabditis

elegans. Grad et al. (2003) developed a computational method using

sequence conservation and structural similarity to predict miRNAs

in the C.elegans genome. Lai et al. (2003) used similar ideas to

develop a different computational tool for the Drosophila genome,

called miRseeker. These efforts have recently been reviewed by

Bartel (2004). Others used homology searches for revealing paralog

and ortholog miRNAs (Weber, 2005; Lagos-Quintana, 2001; Lau

et al., 2001; Lee and Ambros, 2001; Pasquinelli et al ., 2000). In

addition, Wang et al. (2005) developed a method based on sequence

and structure alignment for miRNA identiﬁcation. The most

recent published work of which we are aware that uses machine

learning for miRNA discovery is by Nam et al. (2005). They

constructed a highly speciﬁc probabilistic model (HMM) whose

topology and states are handcrafted based on prior knowledge

and assumptions, and whose exact probabilities are derived from

the data.

In our study we present a machine learning approach based on the

Naı

ve Bayes classiﬁer for predicting miRNA genes. Our method

differs from previous efforts in two ways: (1) we generate the model

automatically and identify rules based on the miRNA gene structure

and sequence allowing prediction of non-conserved miRNAs and

(2) we use a comparative analysis over multiple species to

reduce the false positive (FP) rate. This allows for a trade-off

between sensitivity and speciﬁcity. Based on our experiments

with multiple genomes we believe that our method is applicable

to a wide variety of eukaryotes. The resulting algorithm demon-

strates higher speciﬁcity and similar sensitivity compared to

currently used algorithms, which use conserved genomic regions

to reduce FPs (Lim et al., 2003a, b; Lai et al., 2003; Grad et al.,

2003).

Like Nam et al. (2005), rather than relying on miRNAs homology

between related species, we directly use features of the miRNA

sequence and secondary structure. However in contrast to them,

we train a Naı

ve Bayes classiﬁer to identify miRNAs directly

from the data. In our system prior knowledge is used for initial

ﬁltering of the data, but not for constructing the model. The Naive

Bayes classiﬁer is a standard model with no domain-speciﬁc

assumptions (aside for the usual conditional independence assump-

tions inherent to the model). In addition, whereas Nam’s model was

trained and tested on a single type of data (136 Human miRNAs)



To whom correspondence should be addressed.

下载后可阅读完整内容，剩余9页未读，立即下载

disnep8

粉丝: 0
资源: 10

跨物种基因组数据结合：使用朴素贝叶斯分类器识别microRNA

"贝叶斯统计分析与机器学习解析，概念、方法详解

基于RS485总线的多机通信系统设计摘要

"犁刀变速齿轮箱体双工位组合机床设计与加工工艺

Combining Event-Level and Cross-Event Semantic Information for Event-Oriented Relation Classification by SCNN

Combining X-ray micro-CT technology and 3D printing for the Digital Preservation

A spatial registration method for navigation system combining O-arm with spinal surgery robot

Mortality prediction for ICU patients combining just-in-time learning and extreme learning machine

On Efficiently Combining Limited-Memory and Trust-Region Techniques

Multi-column deep neural network for traffic sign classification

Multi-sensor image fusion using discrete wavelet frame transform

最新资源