自动学习领域差异以适应依赖关系解析

40 浏览量更新于2024-08-28 收藏 618KB PDF 举报

"Learning Domain Differences Automatically for Dependency Parsing Adaptation" 这篇研究论文主要探讨了领域差异与依存句法解析适应之间的关系。作者Mo Yu、Tiejun Zhao和Yalong Bai来自中国哈尔滨工业大学。他们通过定量分析发现，跨领域模型性能下降的主要原因是相同特征在不同领域中的不一致行为，而不是词汇或特征覆盖的不足。在深入研究这些模糊特征后，作者发现这些模糊特征的数量相对较少，并且具有同心分布的特性。这意味着少数特征在不同领域中可能产生混淆，影响解析效果。基于这些分析，论文提出了一个自动学习领域差异的适应性方法（DA方法）。该方法能够根据外领域模型在内领域训练数据上的错误来自动识别哪些特征在跨领域时是模糊的。此外，该方法还扩展到了利用多个外领域模型的能力，以更全面地捕捉领域间的差异。实验结果显示，当从WSJ（Wall Street Journal）领域适应到Genia和Questionbank领域时，这种方法在小规模的内领域数据集上实现了显著的性能提升。这表明在需要领域适应的场景下，该方法尤其有效。这篇论文对依赖句法解析的领域适应问题进行了深入研究，揭示了特征不一致性是关键问题，并提出了一种新的自动学习方法来优化这一过程。这种方法对于提高跨领域句法解析的准确性和效率具有重要的理论和实践意义，特别是在处理小规模特定领域数据时。

Learning Domain Differences

Automatically for Dependency Parsing Adaptation

Mo Yu, Tiejun Zhao, and Yalong Bai

Harbin Institute of Technology

China

{yumo, tjzhao, ylbai}@mtlab.hit.edu.cn

Abstract

In this paper, we address the relation between do-

main differences and domain adaptation for depen-

dency parsing. Our quantitative analyses showed

that it is the inconsistent behavior of same features

cross-domain, rather than word or feature cover-

age, that is the major cause of performances de-

crease of out-domain model. We further studied

those ambiguous features in depth and found that

the set of ambiguous features is small and has con-

centric distributions. Based on the analyses, we

proposed a DA method. The DA method can auto-

matically learn which features are ambiguous cross

domain according to errors made by out-domain

model on in-domain training data. Our method is

also extended to utilize multiple out-domain mod-

els. The results of dependency parser adaptation

from WSJ to Genia and Question bank showed that

our method achieved signiﬁcant improvements on

small in-domain datasets where DA is mostly in

need. Additionally, we achieved improvement on

the published best results of CoNLL07 shared task

on domain adaptation, which conﬁrms the signiﬁ-

cance of our analyses and our method.

1 Introduction

Statistical models are widely used in the ﬁeld of dependency

parsing. However, current models are usually trained and

tested on data from the same domain. When test data belongs

to a domain different from the training data, the performances

of the current dependency parsing models will be greatly de-

graded. Therefore when the labeled Treebank of the target

domain is insufﬁcient, it is difﬁcult to obtain accurate parsing

results on this domain.

To quickly adapt parser to new domains where few in-

domain labeled data is available, various techniques have

been proposed. No labeled data from target domain is needed

for most parser domain adaptation (DA) methods, e.g. self-

training [McClosky et al., 2008; Sagae, 2010], co-training

[Steedman et al., 2003; Sagae and Tsujii, 2007] and word

clustering approach

[

Candito et al., 2011

]

. These unsu-

pervised methods improve performances by helping parsers

cover more domain speciﬁc words or features

[

McClosky and

Charniak, 2008

]

However, as will be shown in this paper, word and feature

coverges is not the only factor affecting cross-domain per-

formance. Speciﬁcally, we take WSJ corpus and Genia cor-

pus

[

Tateisi et al., 2005

]

as examples. During analysis, even

though we added gold POS tags and made the gap of word

coverage no longer exist, performance decline is still not al-

leviated much. It is now the ambiguous feature, behaving in-

constantly in different domains, that brings the performance

drop. In addition, Dredze et al. [2007] pointed out that do-

main differences may exist in different annotation guidelines

between Treebanks.

Above ﬁndings indicated that some labeled data was in

need for handling such differences. Unlike unsupervised

methods with difﬁculty to detect and handle these differ-

ences, current supervised and semi-supervised parser adapta-

tion

[

Hall et al., 2011

]

are proved to get better results. How-

ever, they do not directly solve the domain differences on fea-

tures discussed above.

In this paper, we try to learn which features are ambiguous

between domains with the help of only a small in-domain la-

beled dataset. The key idea is to learn which features are more

likely to be associated with errors based on in-domain train-

ing data. Then the model could identify and correct the un-

reliable arcs based on the ambiguous features they contained,

while still keeping as many reliable arcs outputted by out-

domain model as possible.

There are two major contributions in this paper. First, in

Section 2, quantitative analyses are performed to ﬁnd out

which types of domain differences affect the cross-domain

performances of a parser mostly. As far as we know, few

works

[

Gildea, 2001

]

had focused on this problem. Second,

based on some general rules found in the analyses, in Section

3, we proposed a method to automatically learn domain dif-

ferences from small in-domain dataset, meanwhile avoiding

overﬁtting. Results of experiments are shown in Section 4.

Section 5 gives the conclusion.

2 Analysis on Domain Differences

2.1 Experimental Settings

In this section, we set Genia and WSJ corpus as in-domian

and out-domain data respectively. For WSJ corpus, sections

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence

1876

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38617615

粉丝: 6

自动学习领域差异以适应依赖关系解析

Learning Representation for Multi-View Data Analysis

Statistics for Machine Learning

_____1-Cultural differences (for Ss. to preview).rar

Differences required for significance between Wechsler verbal and performance IQs and WIAT subtests and composites: The predicted-achievement method

Statistics for Machine Learning: Techniques for exploring supervised

Diagnostic differences between educationally handicapped and learning disabled students

Evaluating scaled score differences for the bannatyne recategorization of the Wechsler subtests

Differences

Sex and grade differences and learning rate in an intensive summer reading clinic

Assessing total differences for effective samples having variations in color, coarseness, and glint

最新资源