Chinese Journal of Electronics
Vol.22, No.1, Jan. 2013
Integrating Active Learning Strategy to the
Ensemble Kernel-based Method for
Protein-Protein Interaction Extraction
∗
LI Lishuang, HUANG Degen, WANG Min and JIANG Zhenchao
(School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, China)
Abstract — This paper presents an ensemble kernel-
based active learning method for PPI (Protein-protein in-
teraction) extraction. This ensemble kernel is composed
of feature-based kernel and structure-based kernel. Ex-
perimental results show that the F-scores of PPI extrac-
tion using ensemble kernel model on AIMED (Abstracts
in medline), IEPA (the Interaction extraction performance
assessment corpus) and BCPPI (Biocreative PPI dataset)
corpora are 64.50%, 69.74% and 60.38% respectively. As
the passive learning methods need large labeled data sets
and it is expensive to label data manually, we integrate
active learning strategy into the ensemble kernel model.
The uncertain ty-based sampling strategy is used in the ac-
tive learning method. Two experiments for active learning
are conducted on AIMED, IEPA, BCPPI corpus. The ex-
perimental results integrating the active learning strategy
show that the F-scores on AIMED, IEPA and BCPPI cor-
pora are better than those using the passive learning, and
meantime reduce the labeling data.
Key words — Protein-protein interaction (PPI), Com-
bined kernel, Activ e learning, SVM.
I. Introduction
With the rapid development of computational and biolog-
ical technology, a large amount of information about proteins
and the biomedical literatures are expanding at an exponen-
tial rate. It is becoming more and more difficult for biomedical
experts to detect the protein information manually. Thus, au-
tomated PPI extraction from biomedical literature corpora has
attracted substantial attention.
In recent years, many methods of extracting PPI have
been proposed. These methods can be divided into three cate-
gories: rule-based methods
[1]
, co-occurrence based methods
[2]
and statistical machine learning methods
[3,4]
. Fundel et al.de-
signed a RelEx system
[1]
to extract PPI from free text. This
system produced dependency parse trees based on NLP and
made a number of rules to parse the trees. Since rule-based
methods utilize pre-defined rules, they are unable to discover
new phrase patterns without the known keywords. Meanwhile,
some syntax parsers with large coverage may over-generate ir-
relevant parses and led to incorrect relations. Co-occurrence
based methods simply use co-occurrence statistics of two en-
tities to predict their relation. Bunescu et al. investigated the
methods which used multiple occurrences of the same pair of
entities across a collection of documents in order to boost the
performance of a relation extraction system
[2]
.However,co-
occurrence based methods can only extract well-known PPIs
but may not be able to find new emerging PPIs. Typically, a
co-occurrence based method exhibits high recall but low pre-
cision.
Statistical machine learning methods can overcome the
limitation of the above two methods
[5−7]
. Compared with rule-
based methods, they need not to extract rules and can identify
new emerging PPIs. Statistical machine learning methods can
be categorized into the feature vector-based methods
[3]
and the
kernel-based methods. Liu et al.
[3]
proposed a feature-based
method that incorporated dependency information as well as
other lexical and syntactic knowledge. The performance of the
feature vector-based methods is affected by selected features
and that method can not make full use of deep parsing informa-
tion. So the kernel-based method is proposed, which can uti-
lize the structural information in a given sentence. Yang et al.
provided a weighted multiple kernel learning-based approach
for automatic PPI extraction from biomedical literature. The
approach combined the following kernels: feature-based, tree,
graph and Part-of-speech (POS) path
[4]
. This method pre-
sented the potential relation by a graph and defined a graph-
based kernel in order to learn from a graph. Their method
achieved 56.4% F-score on the AIMED corpus.
The kernel-based methods can make most of deep parsing
information while they neglect the lexical features. To make
most of the feature vector-based methods and the kernel-based
methods, a method that combines these two methods is pro-
posed. Zhang et al.
[8]
designed an ensemble kernel combining
the word feature-based kernel and the path-kernel. They used
the forward matching and backward matching algorithm to
calculate the similarity of the path between proteins in the
path-kernel. Then, they combined the word feature based
kernel and the path-kernel within a liner kernel. Compared
with Zhang’s method
[8]
, our method incorporates another path
matching algorithm, the Longest common subsequence (LCS).
∗
Manuscript Received Dec. 2011; Accepted Feb. 2012. This work is supported by the National Natural Science Foundation of China
(No.61173101, No.61173100).