Personal Attributes Extraction in Chinese Text Bakeoff in CLP 2014:
Overview
Ruifeng Xu, Shuai Wang, Feng Shi
Key Laboratory of Network Oriented
Intelligent Computation, Shenzhen Graduate
School, Harbin Institute of Technology,
China
xuruifeng@hitsz.edu.cn
Jian Xu
Department of Computing, The Hong Kong
Polytechnic University, Hong Kong
csjxu@comp.polyu.edu.hk
Abstract
This paper presents the overview of
Personal Attributes Extraction in Chinese
Text Bakeoff in CLP 2014. Personal
attribute extraction plays an important
role in information extraction, event
tracking, entity disambiguation and other
related research areas. This task is
designed to evaluate the techniques for
extracting person specific attributes from
unstructured Chinese texts, which is
similar to slot filling, but focuses on
person attributes. This task brings some
challenges issues because Chinese
language contains some common words
and lacks of capital clues as in English.
The task organizer manually constructs
the query names and corresponding
documents. The value/presence of the
texts corresponding 25 pre-defined
attributes are annotated to construct the
training and testing dataset. The bakeoff
results achieved by the participators show
the good progress in this field.
1 Introduction
Personal Attributes Extraction in Chinese Text
Task is designed to evaluate the techniques for
extracting person specific attributes, such as birth
date, spouse, children, education, and title etc.
from unstructured Chinese texts. These
techniques play an important role in information
extraction, event tracking, entity disambiguation
and other related research areas.
Slot filling task has been proposed as one of
shared tasks in the TAC KBP workshop since
2009 [1]. Generally speaking, the mainstream
techniques for slot filling and person attributes
extraction may be camped into two major
approaches, namely: Rule-based approach and
statistics-based ones [2,3,4]. Rule-based
approach normally defines the extraction rules
manually or learns the rules automatically. The
rules play the key role in this approach. As long
as finding the constraint information which
matches the rules in the text, the system may
extract the target extraction information.
As for
the statistics-based approach, it has good
portability to this extraction problem. Several
statistics machine learning models such as
Hidden Markov Model (HMM) and Condition
Random Fields (CRFs) are employed. The
shortcoming for this approach is that it requires
large amount of training data which is always
unavailable.
Currently, there are limited existing works on
personal attributes extraction in Chinese text.
Comparing to the works on English, the
characteristics of Chinese language including the
Chinese word segmentation, the confusion of
named entity with common words, lack of capital
clues bring more difficulties for person attributes
extraction in Chinese.
The task of person attributes extraction in
Chinese text in CLP 2014 bakeoff is designed on
the basis of the slot filling task in the TAC KBP
workshop [1]. The task organizer provides a
collection of documents corresponding to a target
person and a knowledge base which contains
partial list of attributes for the person.
Participants are required to extract additional
attributes from the collections of documents. The
task is similar to the slot filling, but it focuses on
person attributes extraction. Furthermore, the
collection of documents is not limited to the
news corpus.