Research of Semantic Role Labeling and Application in
Patent knowledge Extraction
Ling’en Meng
Institute of Scientific and Technical
Information of China, Beijing
mengle2013@istic.ac.cn
Yanqing He
*
Institute of Scientific and Technical
Information of China, Beijing
heyq@istic.ac.cn
Ying Li
*
Institute of Scientific and Technical
Information of China, Beijing
liying@istic.ac.cn
ABSTRACT
Semantic Role Labeling (SRL) is a leading task of identifying
arguments for a predicate and assigning semantically meaningful
labels to them. SRL is crucial to information extraction, question
answering, and machine translation. When applied to patent text,
existing tools for SRL have unsatisfying performance because of
long sentences. To improve performance in patent SRL systems,
this study separates each sentence in patent abstracts into a
simpler structure, and then labels semantic roles for the simplified
sentence. At last, semantic information and semantic framework
for frequently used words are used to extract patent knowledge.
Our work demonstrates that the method used in this article can
improve the performance in SRL system and obtain beneficial
knowledge from patents.
Categories and Subject Descriptors
I.2.7 [Computing Methodologies]: Language Constructs and
Features –Language parsing and understanding, Text analysis.
General Terms
Algorithms, Experimentation, Languages
Keywords
Semantic role labeling, Patent text, Patent knowledge extraction
1. INTRODUCTION
Semantic Role Labeling is the process of annotating the predicate-
argument structure in text with semantic labels. SRL includes two
sub-tasks: the identification of syntactic constituents that are
semantic roles probably, and the labeling of those constituents
with the correct semantic role
[1]
. Most of current researches on
SRL focus on using supervised learning method including
generative model and discriminate model. The generative model is
firstly used in the SRL classification model. This model has fast
training rate and the dependence on the training corpus is not
strong. But the poor description ability and strong assumption of
features independence lead to unsatisfactory performance.
Discriminate models directly estimate the final goal of
optimization-- conditional probability. The process is usually
performed by iterative methods to find some optimized
coefficients. Discriminant models generally include linear
interpolation, SVM
[2]
, Perceptron
[3]
, SNoW(Sparse Network of
Winnows)
[4]
, Boosting
[5]
, Maximum Entropy, Decision tree,
Random forest
[6]
, etc. Combining the results produced by multiple
classifiers is a development direction and can obtain better results
than any one classifier. These supervised learning methods above
are often dependent on the effect of syntactic parsing and accurate
annotation of SRL. It is widely used in information extraction,
question answering, and machine translation.
SRL has the vital significance in shallow semantic parsing for text
information, especially patent texts. Patent texts contain useful
information about technologies. Analyzing patent texts can master
the present situation of patent texts, predict the hotspot timely and
grasp the trend of the technology. The existing patent platforms
Patsnap (http://cn.patsnap.com/), TechGlory (Patent risk controls
and competitive intelligence analysis system. http://www.tek-
glory.cn/), and Wang Xuefeng
[7]
use a manually annotated corpus,
they have high cost and low speed. Researchers also adopt
automatic extraction method to obtain key information from
patent texts. Jiang Caihong
[8]
constructs an ontology and writes
rules for patent knowledge extraction. Zhai Dongsheng
[9]
uses
ontology knowledge and semantic inference measure to construct
a reference network of patent.
This article introduces SRL information combined with a
semantic framework rules to extract patent technical topic from
patent abstract. As we all know, patent text usually has the
characteristic of long sentences with complex structures. As SRL
systems are ported into patent texts, they get poor results and
affect the effectiveness of the semantic analysis and knowledge
extraction. Compare the following examples:
Long sentence:
A plurality of resonance units are arranged [ARGM-TMP
in the shell], wherein one end of each resonance unit is fixed on
the inner wall at one side of the shell.
Simplified sentence:
A plurality of resonance units are arranged [ARGM-LOC in
the shell]
one end of each resonance unit is fixed on the inner wall at o
ne side of the shell.
It‘s obviously that the sematic tag ARGM-TMP (ARGM-TMP
represents time, more details in 2.2) in long sentence is wrong.
The correct tag is ARGM-LOC (ARGM-LOC represents location)
in the simplified sentence. To resolve the above problem, our
approach separates each long complicated sentence in patent
abstracts into a simpler structure, then labels semantic roles for
the simplified sentences, finally, synthesizes all the semantic
labels and semantic framework to extract patent topic. Finally,
Copyright © 2014 for the individual papers by the papers´ authors.
Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
Published at Ceur-ws.org
Proceedings of the First International Workshop on Patent Mining and Its
Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014.
At KONVENS´14, October 8-10, 2014, Hildesheim, Germany.