A Dataset for Open Event Extraction in English
Kiem-Hieu Nguyen
1,∗
, Xavier Tannier
2
, Olivier Ferret
3
, Romaric Besanc¸on
3
1. Hanoi Univ. of Science and Technology, 1 Dai Co Viet, Hai Ba Trung, Hanoi, Vietnam
2. LIMSI, CNRS, Univ. Paris-Sud, Universit
´
e Paris-Saclay, rue John von Neumann, 91403 Orsay, France
3. CEA, LIST, Vision and Content Engineering Laboratory, F-91191, Gif-sur-Yvette, France
Abstract
This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the
task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in
a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size,
non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies
some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search
engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for
evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate
some existing systems on this new data.
Keywords: Event extraction, corpus creation, unsupervised methods.
1. Introduction
Information Extraction has been defined by the Message
Understanding Conference (MUC) evaluations (Grishman
and Sundheim, 1996) and its successors, i.e. the Automatic
Content Extraction (ACE) (Doddington et al., 2004) and
Text Analysis Conference (TAC) (Ellis et al., 2014)
evaluations, specifically by the task of template filling. The
objective of this task is to assign event roles to individual
textual mentions. A template defines a specific type
of events (e.g. earthquakes), associated with semantic
roles (or slots) hold by entities (for earthquakes, typically
their location, date, magnitude and the damages they
caused (Jean-Louis et al., 2011)). This kind of structures is
comparable to the schemas of (Schank and Abelson, 1977).
Schema induction is the task of learning these structures
with no supervision from unlabeled texts. We focus here
more specifically on event schema induction (Chambers
and Jurafsky, 2011; Chambers, 2013; Cheung et al., 2013;
Nguyen et al., 2015). The idea is to group entities
corresponding to the same role into an event template.
Figure 1 illustrates this process.
Previous work on event schema induction was evaluated
on the MUC-4 corpus (Grishman and Sundheim, 1996).
However, this corpus raises two main issues:
• It was annotated with templates describing all events
with the same set of slots.
• It doesn’t contain redundancy.
The first issue is clearly a limitation due to the fact that
all the considered types of events in the MUC-4 corpus
are close to each other while the second issue is more a
difficulty for applying current machine learning methods.
In this paper, we propose the ASTRE corpus in order to
tackle these two issues. We report experimental results on
∗ This author was affiliated at LIMSI-CNRS when working
on this project.
Slot i
Slot i+1
ATTACK
Perpetrator
Instrument
Target
Victim
BOMBING
Perpetrator
Instrument
Target
Victim
citizen
woman
police
victim
civilian
Documents
Schemas
(templates/slots)
bomb
explosion
re
charge
explosive
Data aggregation /
Schema building
ARSON
Perpetrator
Instrument
Target
Victim
...
...
...
Figure 1: Event induction process (MUC schema example).
this corpus using state-of-the-art event schema induction
methods. The rest of the paper is organized as follows.
Section 2 presents the MUC-4 corpus and its limitations
for evaluating schema induction. It also discusses its
successors, i.e. the ACE and TAC corpora. Section 3
describes the creation of the ASTRE corpus while Section 4
shows the evaluation results of two state-of-the-art systems
for open event extraction task on it. Finally, Section 5
concludes the paper.
2. MUC-4 Corpus
A significant part of the work in the field of event schema
induction from texts such as (Chambers and Jurafsky, 2011;
Chambers, 2013; Cheung et al., 2013; Nguyen et al.,
2015) relies on the MUC-4 corpus for its evaluation. This
corpus contains 1,700 news articles about terrorist incidents
happening in Latin America. The corpus is divided into
1939