automatically determined to better include syntactic and semantic information. Moreover, a context-sensitive CTK,
which enumerates both context-free and context-sensitive sub-trees by considering their ancestor node paths as their
contexts, is proposed to better capture structural information in the semantic relation tree structure. Finally, our tree
kernel and a state-of-the-art linear kernel are interpolated by using a composite kernel to evaluate their complementary
nature.
The layout of this paper is as follows: First, related work is reviewed in more detail in Section 2. The rich semantic relation
tree structure is then introduced in Section 3, while the context-sensitive CTK is proposed in Section 4. In Section 5, the tree
kernel-based semantic relation extraction is systematically evaluated on the ACE RDC corpora. Finally, our work is concluded
in Section 6.
2. Related work
Semantic relation extraction was first introduced as part of the Template Element task in the sixth Message Understand-
ing Conference (MUC-6) and then formulated as the template relation task in the seventh Message Understanding Confer-
ence (MUC-7). With the introduction of the ACE program, it was further reformulated as the RDC task in the ACE program.
Since then, many methods, such as feature vector-based methods [8,10,25–28], tree kernel-based methods [20,6,2,21,22],
and composite kernel-based methods [24,21,22] have been proposed in the literature.
For the feature vector-based methods, Kambhatla [10] employed Maximum Entropy models to combine diverse lexical,
syntactic and semantic features in semantic relation extraction, and achieved an F-measure of 52.8 on the 24 relation sub-
types of the ACE RDC 2003 corpus. Zhou et al. [25,27] systematically explored diverse features through a linear kernel and
with Support Vector Machines (SVM), and achieved F-measures of 68.0 and 55.5 on the five relation types and the 24 relation
subtypes of the ACE RDC 2003 corpus, respectively. Zhou et al. [26,28] further improved the performance by exploring the
commonality among related classes in a class hierarchy with a hierarchical learning strategy. Jiang and Zhai [8] also system-
atically evaluated the effectiveness of different feature subspaces with different complexities and obtained the best F-mea-
sure of 71.5 on the seven relation types of the ACE RDC 2004 corpus. One problem with feature vector-based methods is that
they often require extensive feature engineering (e.g. feature design, implementation and selection). Another problem is that
although they can explore some structural information in the parse tree, it is difficult to preserve the structural information
in the parse trees with the feature vector-based methods (e.g., [10] used the non-terminal path connecting the two given
entities in a parse tree, while Zhou et al. [23,27] introduced additional chunking features to enhance the performance).
As an alternative to the feature vector-based methods, the kernel-based methods [7] have been proposed to implicitly
explore various features in a high dimensional space by employing a kernel to directly calculate the similarity between
two objects. In particular, kernel-based methods can be effective in reducing the burden of feature engineering for structured
objects in Natural Language Processing (NLP) research, such as the tree structure in semantic relation extraction.
Zelenko et al. [20] proposed a kernel between two parse trees, which recursively matches nodes from roots to leaves in a
top-down manner. For each pair of matched nodes, a subsequent kernel on their child nodes is invoked. They achieved great
success in two simple semantic relation extraction tasks. Culotta and Sorensen [6] extended their work to estimate the sim-
ilarity between augmented dependency trees and achieved an F-measure of 45.8 on the five relation types of the ACE RDC
2003 corpus. One problem with the above two tree kernels is that two matched nodes must be at the same height and have
the same path to their respective root nodes. Bunescu and Mooney [2] proposed the shortest path dependency tree kernel,
which sums up the number of common word classes at each position in the two paths, and achieved an F-measure of 52.5 on
the five relation types of the ACE RDC 2003 corpus. They argued that the information to model a relationship between two
entities could be typically captured by the shortest path between them in the dependency graph. Their kernel is unable to
fully preserve the structured dependency tree information, and it is also conditioned by the fact that the two matched paths
should have the same length. This makes it suffer from behavior similar to that reported in the work of Culotta and Sorensen
[6], that is, high precision but low recall.
To develop an effective tree kernel method, Zhang et al. [21,22] explored various semantic relation tree structures and
used the standard CTK over semantic relation trees [5] to model structural information for semantic relation extraction.
They achieved F-measures of 61.9 and 63.6 on the five relation types of the ACE RDC 2003 corpus and the seven relation
types of the ACE RDC 2004 corpus, respectively, without entity-related information, while the F-measure on the five rela-
tion types of the ACE RDC 2003 corpus reached 68.7 when entity-related information was included in the parse tree. One
problem
with
the standard CTK is that the sub-trees involved in the tree kernel computation are context-free, that is, they
do not consider the information outside the sub-trees. This is different from the tree kernel in [6], where the sub-trees
involved in the tree kernel computation are context-sensitive (i.e., they consider the path from the tree root node to
the sub-tree root node). Zhang et al. [21,22] also showed that the widely-used SPT structure performed best. However,
one problem with the SPT is that it fails to capture the contextual information outside the shortest path, yet such infor-
mation is important for semantic relation extraction in many cases. Our random selection of 100 positive training in-
stances from the ACE RDC 2003 training corpus shows that about 25% of the cases need contextual information outside
the shortest path. Among others, Bunescu and Mooney [3] proposed a subsequence kernel and applied it in protein–pro-
tein interaction extraction and the ACE RDC tasks. Zhang et al. [23] employed a grammar-driven CTK in semantic role
labeling and achieved certain success, following the work pioneered by Moschitti [13].
G. Zhou et al. / Information Sciences 180 (2010) 1313–1325
1315