Open Domain Atomic Event Extraction via Double Propagation for Chinese Text
Rui Sun
Computer School
Wuhan University
Wuhan, China
e-mail: ruisun@whu.edu.cn
Sheng Guo
Computer School
Wuhan University
Wuhan, China
e-mail: whucsgs@163.com
Donghong Ji
Computer School
Wuhan University
Wuhan, China
e-mail: dhji@whu.edu.cn
Abstract—Recent studies show structured atomic event
information is beneficial to represent the discourse semantic.
However, extracting useful structured representation of events
from open domain is a challenging problem. On one hand,
previous event extraction methods on special domain, cannot
be directly used for open domain because of domain limitation
and predefined event pattern. On the other hand, atomic
event extraction is simply regarded as a preprocessing step
in previous related work, and few studies focus on atomic
event extraction in open domain. In this paper, we propose
an unsupervised method for Chinese event extraction in open
domain. Being directed against the ellipsis and flexible sentence
structure in Chinese text, the proposed method exploits double
propagation (DP) to combine event extraction and event pattern
generation, which does not require seed events or seed event
patterns and is able to eliminate noise from syntactic parsing.
Experimental results on standard benchmark show that our
proposed method outperforms state-of-the-art algorithm.
Keywords-atomic event; double propagation; event
extraction; open domain;
I. INTRODUCTION
With continuous growth of text resources, it is necessary
to study how to extract knowledge from unstructured text. As
a special structure, event represents more complex semantic
relation than entity relation. Most previous studies (e.g.,[1],
[2], [3]) on event extraction are conducted on news articles
in special domain, such as ACE 2005 standard dataset. These
methods cannot be directly used for open domain because
of predefined event pattern. In recent years, the form of
event, Subject + Predicate + Object, has been proved to
be significantly effective for a range of natural language
processing applications (e.g.,[4], [5], [6]). These studies
exploit structured atomic event information to represent the
discourse semantic. However, event extraction is simply
regarded as a preprocessing step in these work. Few studies
pay more attentions on event extraction in open domain.
In this paper, we focus on atomic event extraction in open
domain. The methods of atomic event extraction mainly are
divided into two categories. One is Rule-based (e.g.,[7],
[8], [9]) which directly exploits the syntactic rules like
dependency relation. The event trigger and arguments can
be identified according to some special dependency, such as
nsubj and dobj. The main drawback of the method is that it
relies on the dependency parser. The other is ORE-based
(e.g.,[10], [11]) which extracts the events based on open
relation extraction. As so far, it has achieved a great success
that to extract entity and entity relations from news and
microblogs in open domain. Most of these relations present
the structured event information. However, these methods
give a low recall, because they neglect the fact that the
argument of an event may not be an entity. Especially
in Chinese, as a paratactic language (e.g., discourse-driven
and pro-drop), there are wide spread of ellipsis and more
open flexible sentence structure in the text [2]. Consider the
following discourse as a sample:
“
(E1)8830
(E2)
12.46(E3)1(E4)324
(E5)
8(E6)56880(E7),
(E8)
6988 (E9)13017”
(According to the report of Pu’er City Civil Affairs
Bureau(E1), up to 8 at 8:30, Yunnan Earthquake
caused(E2) 12.46 million people were affected(E3) in
Jinggu County, Simao District, Zhenyuan County, Linxiang
District, Shuangjiang County, etc., one person was
killed(E4) and 324 people were wounded(E5) and eight
people were injured(E6), 56880 people were evacuated(E7),
6988 houses were collapsed(E8) and 13017 houses were
severely damaged(E9). )
In above discourse, we can extract 9 atomic events
exploiting dependency relations, and only extract 6 events
based on open relation extraction tool ZORE [12]. However,
we observe there are some phenomena due to above-
mentioned characteristic of Chinese from this sample.
First, the forms of these atomic events are diversified due
to the open sentence structure. For example, the event
E5“
(people), (wounded), nil” is similar to the event
E6 “nil,
(injured), (people)”, but there syntactic
structures are different. Intuitively, this kind of events
need a unified form. Second, some events may lose there
arguments due to the ellipsis or the far distance between
the arguments and the trigger. For example, the subject
of E7 “nil,
(injured), (people)” is lost due to
missing the dependency relation. The discourse or cross
document information should be exploited to find the lost
2016 IEEE 28th International Conference on Tools with Artificial Intelligence
2375-0197/16 $31.00 © 2016 IEEE
DOI 10.1109/ICTAI.2016.128
843
2016 IEEE 28th International Conference on Tools with Artificial Intelligence
2375-0197/16 $31.00 © 2016 IEEE
DOI 10.1109/ICTAI.2016.128
844
2016 IEEE 28th International Conference on Tools with Artificial Intelligence
2375-0197/16 $31.00 © 2016 IEEE
DOI 10.1109/ICTAI.2016.128
844