Open Relation Extraction from Chinese
Microblog Text
Jing Xu, Liang Gan, Zhou Yan, Quanyuan Wu, Yan Jia
School of Computer Science
National University of Defense Technology
Changsha, China
{jing.xu, gl, zhou.y, qyw, jy} @nudt.edu.cn
Abstract—In recent years, the rapid development of
microblog provides entity relation extraction (ERE) with a new
carrier. However, the characteristics of microblog also bring
some challenges for ERE research. Considering the
characteristics of microblog, the paper proposes an unsupervised
open relation extraction (ORE) method, namely MICRO-ORE.
Firstly, MICRO-ORE uses the left-right information entropy
method to automatically extract the key phrases from microblog
texts, and links them to the external knowledge sources to
normalize microblog texts and add the semantic information.
Secondly, according to the Chinese syntactic features, MICRO-
ORE formulates the extraction rules to extract the relation tuples.
We evaluate the proposed method with SINA microblog texts
and show that it can extract more information than the
traditional relation extraction methods, and meet the accuracy
demand. To our best Knowledge, MICRO-ORE is the first
Chinese ORE method for microblog texts.
Keywords—microblog; semantic extension; open relation
extraction; information extraction
I INTRODUCTION
Microblog, namely Micro Blog, which is a broadcast social
network platform for sharing the real-time information based
on the users’ relationship[1]. Compared with the traditional
blogs, microblog has some different characteristics, which are
“short, flexible, fast”. With the rapid development of Web3.0
technologies, microblog has become the mainstream media of
online social networks, and the users can share, spread and
access the information with the microblog platform. Because of
the convenience, high efficiency as well as large amount of
users, microblog releases huge amounts of information every
day, which shows the latest trends and ideas of the social
public. Therefore, microblog is an import information resource
and has great value in making marketing decisions, obtaining
intelligence information and monitoring social public opinion.
Relation extraction means determining whether there is a
relation between two entities and identifying the relation type.
According to the predefined relation types, the traditional
relation extraction methods extract the relation between two
entities in special domains. Because of the limitations of the
relation types and the corpus size, the traditional methods
cannot process the fast-growing and changing data. In big data
era, the massive network texts and diverse relation types
inspire researching the open relation extraction (ORE)
technologies. The goal of ORE is to break the restriction of the
closed relation types and training corpus, extract the entities
and relation words between the entities from the massive
network texts[2]. Because ORE uses the words in the text to
express the relation between the entities, and the types of
entities are not restricted, it can extract more information than
the traditional methods.
The ORE technologies are mature in processing English
texts and produce some useful tools. TextRunner (Banko et al.)
is the first ORE system, which exploits the heuristic rules to
get the training corpus from the Penn Treebank, and uses the
CRF method to automatically extract the relation phrases and
arguments from the texts[3]. WOE (Wu and Weld) exploits the
infobox information in Wikipedia to label the texts and extracts
the entities and their relations[4]. ReVerb (Fader et al.) regards
the verbs and verb phrases as the relation phrases, and extracts
the nearest nouns as arguments on both sides, then restricts the
extraction procedure to improve the accuracy with the syntactic
and lexical information[5].
Compared to English, Chinese syntax is more complex,
which leads to few ORE researches in Chinese texts. CORE
(Tseng et al.) exploits a series of NLP techniques and
extraction rules to extract entity-relation triples from Chinese
free texts[6]. UnCORE (Qin et al.) is an unsupervised Chinese
ORE method for large scale network texts, which focuses on
extracting the relations between the person, organization, and
location[7].
The current ORE methods mainly focus on the standardized
texts, such as Wikipedia, news, Web pages, and the method for
microblog texts is blank. Therefore, according to the
characteristics of microblog texts, We propose an unsupervised
ORE method to mine the useful information from microblog
texts, namely MICRO-ORE. The contributions are as follow.
(1) MICRO-ORE uses the left-right information entropy
method to automatically extract key phrases from microblog
texts, and links them to the external knowledge sources to
normalize microblog texts and add the semantic information. (2)
According to Chinese syntactic features, MICRO-ORE
formulates the rules to extract relations and arguments.
Arguments are some phrases with the behavior patterns or
expression abilities, which have information value and are not
limited to the entities. In addition, MICRO-ORE uses the
words in the text to express the relation between the entities,