依赖基础的N-gram模型：通用句子实现

需积分: 3 175 浏览量更新于2024-09-19 收藏 171KB PDF 举报

"Dependency-Based N-Gram Models for General Purpose Sentence Realisation" 这篇论文主要探讨了依赖关系为基础的N-gram模型在通用句法实现中的应用。作者包括Yuqing Guo、Josef van Genabith和Haifeng Wang，分别来自都柏林城市大学的计算学院自然语言处理与翻译（NCLT）中心、IBM CAS都柏林中心以及东芝（中国）研发中心。文章发表在2008年国际计算语言学会议（Coling2008）上，展示了如何使用这种模型进行广泛覆盖、概率性的句子生成。传统的句子生成方法通常依赖于图谱分析，通过语法规则对输入表示进行操作。然而，依赖性N-gram模型采用了一种更直接且简单的方法，将无序的依赖关系线性化，从而避免了复杂的语法规则应用。这种方法提高了效率，并在标准英语（如Penn-II树库，BLEU得分0.7440，每句处理时间0.05秒）和中文（如CTB6树库，BLEU得分0.7123，每句处理时间0.14秒）的测试数据上取得了竞争力的准确性和完全覆盖率。 1. 引言句子生成，或称为表面实现，是自然语言处理中的一个重要任务，它涉及到将结构化的信息转化为自然语言形式的句子。这个过程可以用于各种应用，如机器翻译、对话系统和文本摘要。依赖性N-gram模型为这一领域提供了一个新的解决方案。 2. 方法该模型的核心是将输入的无序依赖关系转化为连续的序列，这使得可以直接应用N-gram统计模型。N-gram模型是一种基于概率的语言模型，它考虑了词汇项出现的前后上下文，以预测下一个词汇项。在依赖性框架下，N-gram模型可以捕捉到词汇项之间的结构关联，这些关联在传统词序模型中可能难以捕捉。 3. 实现与评估作者通过实验比较了依赖性N-gram模型与传统的基于图表的生成器，结果显示，新模型在速度和准确性方面都有所提升。BLEU分数是一种常用的自动评估生成文本质量的指标，较高的分数表示生成的句子与参考翻译更接近。 4. 结论依赖性N-gram模型提供了一种有效且灵活的通用句法实现方法，它简化了句子生成过程，同时保持了高质量的输出。这种方法对于处理不同语言和任务的泛化能力具有潜力，未来可能应用于更多的自然语言处理应用中。这篇论文提出了一种创新的、基于依赖关系的N-gram模型，它在句子生成任务中表现出了优越的性能和效率，为自然语言处理研究提供了有价值的贡献。

Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 297–304

Manchester, August 2008

Dependency-Based N-Gram Models for

General Purpose Sentence Realisation

Yuqing Guo

NCLT, School of Computing

Dublin City University

Dublin 9, Ireland

yguo@computing.dcu.ie

Josef van Genabith

NCLT, School of Computing

Dublin City University

IBM CAS, Dublin, Ireland

josef@computing.dcu.ie

Haifeng Wang

Toshiba (China)

Research & Development Center

Beijing, 100738, China

wanghaifeng@rdc.toshiba.com.cn

Abstract

We present dependency-based n-gram

models for general-purpose, wide-

coverage, probabilistic sentence realisa-

tion. Our method linearises unordered

dependencies in input representations

directly rather than via the application

of grammar rules, as in traditional chart-

based generators. The method is simple,

efﬁcient, and achieves competitive accu-

racy and complete coverage on standard

English (Penn-II, 0.7440 BLEU, 0.05

sec/sent) and Chinese (CTB6, 0.7123

BLEU, 0.14 sec/sent) test data.

1 Introduction

Sentence generation,

or surface realisation can be

described as the problem of producing syntacti-

cally, morphologically, and orthographically cor-

rect sentences from a given semantic or syntactic

representation.

Most general-purpose realisation systems de-

veloped to date transform the input into sur-

face form via the application of a set of gram-

mar rules based on particular linguistic theories,

e.g. Lexical Functional Grammar (LFG), Head-

Driven Phrase Structure Grammar (HPSG), Com-

binatory Categorial Grammar (CCG), Tree Ad-

joining Grammar (TAG) etc. These grammar rules

are either carefully handcrafted, as those used in

FUF/SURGE (Elhadad, 1991), LKB (Carroll et al.,

 2008. Licensed under the Creative Commons

Attribution-Noncommercial-Share Alike 3.0 Unported li-

cense (http://creativecommons.org/licenses/by-nc-sa/3.0/).

Some rights reserved.

In this paper, the term “generation” is used generally for

what is more strictly referred to by the term “tactical genera-

tion” or “surface realisation”.

1999), OpenCCG (White, 2004) and XLE (Crouch

et al., 2007), or created semi-automatically (Belz,

2007), or fully automatically extracted from an-

notated corpora, like the HPSG (Nakanishi et

al., 2005), LFG (Cahill and van Genabith, 2006;

Hogan et al., 2007) and CCG (White et al.,

2007) resources derived from the Penn-II Treebank

(PTB) (Marcus et al., 1993).

Over the last decade, probabilistic models have

become widely used in the ﬁeld of natural lan-

guage generation (NLG), often in the form of a re-

alisation ranker in a two-stage generation architec-

ture. The two-stage methodology is characterised

by a separation between generation and selection,

in which rule-based methods are used to generate a

space of possible paraphrases, and statistical meth-

ods are used to select the most likely realisation

from the space. By and large, two statistical mod-

els are used in the rankers to choose output strings:

• N-gram language models over different units,

such as word-level bigram/trigram mod-

els (Bangalore and Rambow, 2000; Langk-

ilde, 2000), or factored language models inte-

grated with syntactic tags (White et al., 2007).

• Log-linear models with different syntactic

and semantic features (Velldal and Oepen,

2005; Nakanishi et al., 2005; Cahill et al.,

2007).

To date, however, probabilistic models learn-

ing direct mappings from generation input to sur-

face strings, without the effort to construct a gram-

mar, have rarely been explored. An exception is

Ratnaparkhi (2000), who presents maximum en-

tropy models to learn attribute ordering and lexi-

cal choice for sentence generation from a semantic

representation of attribute-value pairs, restricted to

an air travel domain.

297

下载后可阅读完整内容，剩余7页未读，立即下载

wherrlich

粉丝: 0
资源: 15

依赖基础的N-gram模型：通用句子实现

dependency-check-7.1.1-release

dependency-check-6.2.2-release.zip

Efficient distributed skyline computation using dependency-based data partitioning

owasp-dependency-check:为OWASP Dependency-Check应用程序提供预下载的NVDCVE更新

azure-pipelines-dependency-track:用于将BOM表报告提交到Dependency-Track的Azure DevOps扩展

各种字符串相似度和距离算法的实现Levenshtein、Jaro-winkler、n-Gram、Q-Gram、Jaccard index、最长公共子序列编辑距离、余弦相似度…….zip

dependency-check-plugin:用于OWASP Dependency-Check的Jenkins插件。 检查项目组件是否存在已知漏洞（例如CVE）

Java字符串相似度：各种字符串相似度和距离算法的实现：Levenshtein，Jaro-winkler，n-Gram，Q-Gram，Jaccard索引，最长公共子序列编辑距离，余弦相似度..

java-dependency-tree-diff

gradle-dependency-locking-demo

最新资源

dependency-check-plugin:用于OWASP Dependency-Check的Jenkins插件。检查项目组件是否存在已知漏洞（例如CVE）