集成主动学习策略提升蛋白质-蛋白质相互作用提取的ensemble kernel方法

25 浏览量更新于2024-08-29 收藏 319KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文主要探讨了如何将主动学习策略融入到基于ensemble kernel（集成核）的方法中，以提升蛋白质-蛋白质相互作用（Protein-Protein Interaction, PPI）提取的性能。在生物信息学领域，PPI是理解细胞功能和疾病机制的重要组成部分。传统的被动学习方法往往依赖大量标注数据，而手动标注数据既耗时又昂贵。因此，研究者们寻求更高效的学习策略，如主动学习，来减少标注需求。论文的核心贡献在于提出了一种结合特征基和结构基两种类型的ensemble kernel。特征基kernel考虑了蛋白质序列和功能特征，如氨基酸组成、保守性等，而结构基kernel则关注蛋白质三维结构的信息。通过将这两种互补的视角融合，作者期望提高模型对PPI识别的准确性和泛化能力。实验结果显示，在三个常用的数据集上，即AIMED（Medline文摘中的PPI）、IEPA（交互提取性能评估语料库）和BCPPI（BioCreative PPI数据集），使用ensemble kernel模型的主动学习策略显著提升了PPI提取的F分数，分别达到了64.50%、69.74%和60.38%。这个提升表明了主动学习策略的有效性，它能够有效地利用有限的标注数据，通过智能选择样本进行训练，从而优化了模型的学习效率。在主动学习方法中，不确定性为基础的采样策略被应用。这意味着模型会优先选择那些预测不确定性较高的样本进行标注，这样既能保证模型对未标注数据的探索，又能避免盲目地过度依赖某些已知类别。在AIMED上的两轮主动学习实验验证了这种方法的有效性，进一步证实了集成主动学习策略对提高PPI提取性能的积极作用。这篇论文不仅提出了一种新颖的PPI提取方法，还展示了如何通过集成不同类型的kernel并引入主动学习策略来解决生物信息学领域中的数据标注问题。这对于推动该领域的自动化和效率提升具有重要的理论价值和实践意义。

资源详情

资源推荐

Chinese Journal of Electronics

Vol.22, No.1, Jan. 2013

Integrating Active Learning Strategy to the

Ensemble Kernel-based Method for

Protein-Protein Interaction Extraction

∗

LI Lishuang, HUANG Degen, WANG Min and JIANG Zhenchao

(School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, China)

Abstract — This paper presents an ensemble kernel-

based active learning method for PPI (Protein-protein in-

teraction) extraction. This ensemble kernel is composed

of feature-based kernel and structure-based kernel. Ex-

perimental results show that the F-scores of PPI extrac-

tion using ensemble kernel model on AIMED (Abstracts

in medline), IEPA (the Interaction extraction performance

assessment corpus) and BCPPI (Biocreative PPI dataset)

corpora are 64.50%, 69.74% and 60.38% respectively. As

the passive learning methods need large labeled data sets

and it is expensive to label data manually, we integrate

active learning strategy into the ensemble kernel model.

The uncertain ty-based sampling strategy is used in the ac-

tive learning method. Two experiments for active learning

are conducted on AIMED, IEPA, BCPPI corpus. The ex-

perimental results integrating the active learning strategy

show that the F-scores on AIMED, IEPA and BCPPI cor-

pora are better than those using the passive learning, and

meantime reduce the labeling data.

Key words — Protein-protein interaction (PPI), Com-

bined kernel, Activ e learning, SVM.

I. Introduction

With the rapid development of computational and biolog-

ical technology, a large amount of information about proteins

and the biomedical literatures are expanding at an exponen-

tial rate. It is becoming more and more diﬃcult for biomedical

experts to detect the protein information manually. Thus, au-

tomated PPI extraction from biomedical literature corpora has

attracted substantial attention.

In recent years, many methods of extracting PPI have

been proposed. These methods can be divided into three cate-

gories: rule-based methods

[1]

, co-occurrence based methods

[2]

and statistical machine learning methods

[3,4]

. Fundel et al.de-

signed a RelEx system

[1]

to extract PPI from free text. This

system produced dependency parse trees based on NLP and

made a number of rules to parse the trees. Since rule-based

methods utilize pre-deﬁned rules, they are unable to discover

new phrase patterns without the known keywords. Meanwhile,

some syntax parsers with large coverage may over-generate ir-

relevant parses and led to incorrect relations. Co-occurrence

based methods simply use co-occurrence statistics of two en-

tities to predict their relation. Bunescu et al. investigated the

methods which used multiple occurrences of the same pair of

entities across a collection of documents in order to boost the

performance of a relation extraction system

[2]

.However,co-

occurrence based methods can only extract well-known PPIs

but may not be able to ﬁnd new emerging PPIs. Typically, a

co-occurrence based method exhibits high recall but low pre-

cision.

Statistical machine learning methods can overcome the

limitation of the above two methods

[5−7]

. Compared with rule-

based methods, they need not to extract rules and can identify

new emerging PPIs. Statistical machine learning methods can

be categorized into the feature vector-based methods

[3]

and the

kernel-based methods. Liu et al.

[3]

proposed a feature-based

method that incorporated dependency information as well as

other lexical and syntactic knowledge. The performance of the

feature vector-based methods is aﬀected by selected features

and that method can not make full use of deep parsing informa-

tion. So the kernel-based method is proposed, which can uti-

lize the structural information in a given sentence. Yang et al.

provided a weighted multiple kernel learning-based approach

for automatic PPI extraction from biomedical literature. The

approach combined the following kernels: feature-based, tree,

graph and Part-of-speech (POS) path

[4]

. This method pre-

sented the potential relation by a graph and deﬁned a graph-

based kernel in order to learn from a graph. Their method

achieved 56.4% F-score on the AIMED corpus.

The kernel-based methods can make most of deep parsing

information while they neglect the lexical features. To make

most of the feature vector-based methods and the kernel-based

methods, a method that combines these two methods is pro-

posed. Zhang et al.

[8]

designed an ensemble kernel combining

the word feature-based kernel and the path-kernel. They used

the forward matching and backward matching algorithm to

calculate the similarity of the path between proteins in the

path-kernel. Then, they combined the word feature based

kernel and the path-kernel within a liner kernel. Compared

with Zhang’s method

[8]

, our method incorporates another path

matching algorithm, the Longest common subsequence (LCS).

∗

Manuscript Received Dec. 2011; Accepted Feb. 2012. This work is supported by the National Natural Science Foundation of China

(No.61173101, No.61173100).

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38742927

粉丝: 9
资源: 936

集成主动学习策略提升蛋白质-蛋白质相互作用提取的ensemble kernel方法

Integrating N-gram model and case-based learning for Chinese word segmentation

Microsoft WinCE6.0 Exam Preparation Kit for the Microsoft Certified Technology Specialist (MCTS)

NoC-based SoC Design

springboot+onvif

Demonstrate that you can generate the series for ex as (define exp-series (stream-cons 1 (integrate-series-tail exp-series))) Explain the reasoning behind this definition. Show how to generate the series for sine and cosine, in a similar way, as a pair of mutually recursive definitions.

Describe the background information of Significance of analyzing metal-transfer images for quality control and process optimization in detail

找十条中英文关于企业会议室预约与管理系统的参考文献以及段落

为解决遥感图像难分类问题，本文提出一种结合 CNN 与 Transformer 优点的图像分类 方法。翻译成英文

matlab java

spring-security-oauth2-autoconfigure

springsecurity demo

v4l2_device_register_subdev

Semantic Scene Completion via Integrating Instances and Scene in-the-Loop

freertos sqlite3

CARS python

LIME python

what is integrate-series-tail function？

mybatis plus blob

最新资源

为解决遥感图像难分类问题，本文提出一种结合 CNN 与 Transformer 优点的图像分类方法。翻译成英文