中国AMR语料库推动的语义角色标注新框架：简化与效率提升

8 浏览量更新于2024-08-26 收藏 574KB PDF 举报

语义角色标注（Semantic Role Labeling, SRL）是中文自然语言处理中的核心任务之一，它涉及识别句子中词汇在句法结构中的意义角色，如动作的发起者、承受者等。然而，目前SRL的构建面临几个关键挑战。首先，关于语义角色的数量和框架定义存在分歧。不同的研究者对语义角色的理解和分类标准并不统一，这导致了数据集的一致性和互操作性问题。为了克服这种分歧，一个理想的框架应该提供明确且普适的角色类型定义，以便于跨研究的比较和分析。其次，静态的谓词框架难以涵盖动词使用的动态性。在实际的语言表达中，同一个动词可能根据上下文有不同的用法，这就需要SRL系统具备一定的灵活性，能够捕捉到这些动态变化。例如，"吃饭"这个动词在"他每天都在餐厅吃饭"和"他在厨房做饭"这两个句子中扮演的角色就大相径庭。第三，SRL系统的另一个难题是处理掉尾词（dropped roles）。在某些情况下，虽然词语参与了句子的意义表达，但在句法结构中可能不明显或被省略，这使得它们在传统的SRL框架下难以被标注。AMR（Abstract Meaning Representation）作为一种新型的句子意义表示方法，提供了动态的机制来处理这种情况。AMR强调将句子简化为抽象的概念图，其中每个节点代表一个概念，边则表示概念之间的关系，这有助于标识出即使在语法上未直接体现但对理解至关重要的语义角色。中国AMR语料库的研究者们在探索如何利用AMR的这些特性来改进SRL的标注过程。他们设计了一种新的、更简单和高效的框架，旨在解决上述问题。通过AMR的动态性和概念图的形式，他们试图建立一种更加精确且适应性强的标注策略，使得SRL不仅能覆盖静态的框架，还能更好地应对动词的多变用法以及处理掉尾词的情况。这个新框架可能包括以下几个方面： 1. **统一的角色定义**：借鉴AMR的思路，可能提出一套基于概念和关系的通用语义角色框架，减少对特定动词和上下文依赖的过强假设。 2. **动态框架扩展**：利用AMR的动态扩展机制，允许对动词用法的变化进行灵活的标注，捕捉到动词的多义性和语境依赖性。 3. **处理掉尾词的方法**：通过分析句子的深层结构和概念关系，即使在传统SRL中缺失的语义角色也能在AMR图中找到对应的表示。 4. **自动化和效率提升**：通过机器学习和深度学习技术，设计算法自动学习并应用新的标注策略，提高标注的准确性和效率。这项研究旨在探索和实证一个更符合汉语特点和语义复杂性的SRL框架，其潜在成果对于提高中文自然语言处理的整体性能具有重要意义，并为后续的SRL研究和应用提供了新的视角和方法。

An Easier and Efficient Framework to Annotate Semantic Roles:

Evidence from the Chinese AMR Corpus

Li Song

, Yuan Wen

, Sijia Ge

, Bin Li

, Junsheng Zhou

, Weiguang Qu

2, 3

, Nianwen Xue

1. School of Chinese Language and Literature, Nanjing Normal University, Nanjing, 210024, China

2. School of Computer Science and Technology, Nanjing Normal University, Nanjing, 210023, China

3. Key Laboratory of Information Processing and Intelligent Control, Minjiang University, Fuzhou, 350108, China

4. Computer Science Department, Brandeis University, Waltham, 02453, USA

songli.njnu@gmail.com

Abstract

Semantic role labeling (SRL) is one of fundamental tasks

in Chinese language processing. At present, it has three

major problems on the construction of the SRL corpus.

First, there are disagreements over the definition of the

number and frame of semantic roles. Second, static

predicate frames are hard to cover dynamic predicate

usages. Third, it is unable to annotate the dropped

semantic roles. The newly designed Abstract Meaning

Representation (AMR) is a novel method of representing

the meaning of sentences, which offers dynamic

mechanisms to provide better solutions to the above three

problems. We use the Chinese AMR corpus of 5,000

sentences to make a detailed comparison between AMR

and other SRL resources. Data analysis shows that in

AMR, it is easier to annotate the semantic roles of a

predicate with the simplified distinction between core

roles and non-core roles. And 1,045 tokens of dropped

roles are annotated under this new framework. It

indicates that AMR offers a better solution for Chinese

SRL and sentence meaning processing.

Keywords: Abstract Meaning Representation, predicate

framework, semantic role, language knowledgebase

1 Introduction

Automatic semantic analysis is one of the core tasks in

Natural Language Processing (NLP). Therefore, building

the semantic resources is the first step for machine learning

based NLP systems. In semantic representation, semantic

relations between predicates and their semantic roles form

the backbone of the sentence structure. Thus, building the

predicate frames which describe such information becomes

an important issue in linguistics and NLP. There have been

many semantic role labeling (SRL) systems and SRL

resources in different languages, but there are several

problems in these SRL corpus.

First, the number of the semantic role labels of predicates

is still to be discussed in linguistics. VerbNet uses 30

general thematic role labels to represent semantic relations

(Kipper et al., 2000). Sinica Treebank distinguishes

necessary and unnecessary arguments and uses 60 semantic

role labels, 12 of which can represent necessary arguments

(Chen et al. 2003). FrameNet defines semantic roles on a

per-frame basis (Baker et al., 1998), so it avoids

determining how many semantic roles are needed for a

language, and there are 1224 frames in FrameNet and 323

frames in Chinese FrameNet (CFN). PropBank (Palmer et

al., 2005) and Chinese Proposition Bank (CPB) (Xue &

https://catalog.ldc.upenn.edu/LDC2017T10

Palmer, 2009) both define 5 predicate-specific semantic

roles for the core arguments and 13 semantic roles that are

consistent across predicates for non-core arguments. It can

be seen that the number of role labels used by different SRL

resources is quite different. This is mainly because these

resources are based on different theoretical backgrounds.

Second, it is hard for static predicate frames to cover

dynamic predicate usages. Predicate frames which do not

distinguish core and non-core roles are difficult to represent

whether a semantic role is necessary for the predicate. And

resources that define core roles in a predicate-independent

manner just as non-core roles neither could solve the

collision between core and non-core roles nor could

represent multi-functional semantic roles.

Third, limited to the annotating mechanism, most SRL

systems are unable to annotate the dropped semantic roles

of the predicates. For example, it is hard for most SRL

systems to represent correctly the meaning of the nominal

phrase the injured whose central words are dropped and one

of which… which drops the noun that appeared in the

preceding clause.

Abstract Meaning Representation (AMR), a new method

to represent meaning of sentences, defines semantic roles

in a manner different from other SRL systems (Banarescu

et al., 2013). It deals with core and non-core roles in

different specialized ways. AMR annotates core arguments

using the same five core role labels as in PropBank, which

are predicate-specific, and adopts the predicate frame

lexicon extracted from PropBank. But the number of non-

core role labels that are general to all the predicates is up to

40. At the same time, AMR allows to add back dropped

semantic roles in the sentences. Through the dynamic

mechanisms, AMR can provide better solutions to the

above three problems. The English AMR Sembank

has

included 39,260 sentences and become an important

semantic resource.

Referring to the guidelines of English AMR, Li et al.

(2016) has developed annotation specifications for Chinese

AMR (CAMR), taking linguistic characteristics of the

Chinese language into account. CAMR uses the same 5

core role labels (arg0-arg4) and 44 non-core role labels

(time, location, cause, etc., four of which are added based

on the needs of Chinese annotation) as AMR. The predicate

frame lexicon of CAMR is extracted from the corpus (Bai

& Xue, 2016) of Chinese Proposition Bank (CPB) (Xue &

Palmer, 2009). In addition, Li et al. (2017) designs a

framework for aligning the concepts and relations to word

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38520192

粉丝: 6

中国AMR语料库推动的语义角色标注新框架：简化与效率提升

clec中国学习者英语语料库

hebrew-guidelines:IAHLT希伯来语语料库项目的注释准则

Graphene-Annotator:用于注释大型语料库的多线程石墨烯程序

ParlaMint：ParlaMint：可比的国会语料库

CoARiJ:日本年度报告语料库

PTTCorp:PTT的语言语料库

mycorpus:使我的语料库

cometa:在线医疗实体语料库

WebCorpus:用于大型网络语料库的可扩展处理的 Hadoop 框架-开源

cpc1.0:Cookpad解析的语料库

最新资源