Dependency-Based N-Gram Models for
General Purpose Sentence Realisation
Yuqing Guo
NCLT, School of Computing
Dublin City University
Dublin 9, Ireland
yguo@computing.dcu.ie
Josef van Genabith
NCLT, School of Computing
Dublin City University
IBM CAS, Dublin, Ireland
josef@computing.dcu.ie
Haifeng Wang
Toshiba (China)
Research & Development Center
Beijing, 100738, China
wanghaifeng@rdc.toshiba.com.cn
Abstract
We present dependency-based n-gram
models for general-purpose, wide-
coverage, probabilistic sentence realisa-
tion. Our method linearises unordered
dependencies in input representations
directly rather than via the application
of grammar rules, as in traditional chart-
based generators. The method is simple,
efficient, and achieves competitive accu-
racy and complete coverage on standard
English (Penn-II, 0.7440 BLEU, 0.05
sec/sent) and Chinese (CTB6, 0.7123
BLEU, 0.14 sec/sent) test data.
1 Introduction
Sentence generation,
1
or surface realisation can be
described as the problem of producing syntacti-
cally, morphologically, and orthographically cor-
rect sentences from a given semantic or syntactic
representation.
Most general-purpose realisation systems de-
veloped to date transform the input into sur-
face form via the application of a set of gram-
mar rules based on particular linguistic theories,
e.g. Lexical Functional Grammar (LFG), Head-
Driven Phrase Structure Grammar (HPSG), Com-
binatory Categorial Grammar (CCG), Tree Ad-
joining Grammar (TAG) etc. These grammar rules
are either carefully handcrafted, as those used in
FUF/SURGE (Elhadad, 1991), LKB (Carroll et al.,
c
2008. Licensed under the Creative Commons
Attribution-Noncommercial-Share Alike 3.0 Unported li-
cense (http://creativecommons.org/licenses/by-nc-sa/3.0/).
Some rights reserved.
1
In this paper, the term “generation” is used generally for
what is more strictly referred to by the term “tactical genera-
tion” or “surface realisation”.
1999), OpenCCG (White, 2004) and XLE (Crouch
et al., 2007), or created semi-automatically (Belz,
2007), or fully automatically extracted from an-
notated corpora, like the HPSG (Nakanishi et
al., 2005), LFG (Cahill and van Genabith, 2006;
Hogan et al., 2007) and CCG (White et al.,
2007) resources derived from the Penn-II Treebank
(PTB) (Marcus et al., 1993).
Over the last decade, probabilistic models have
become widely used in the field of natural lan-
guage generation (NLG), often in the form of a re-
alisation ranker in a two-stage generation architec-
ture. The two-stage methodology is characterised
by a separation between generation and selection,
in which rule-based methods are used to generate a
space of possible paraphrases, and statistical meth-
ods are used to select the most likely realisation
from the space. By and large, two statistical mod-
els are used in the rankers to choose output strings:
• N-gram language models over different units,
such as word-level bigram/trigram mod-
els (Bangalore and Rambow, 2000; Langk-
ilde, 2000), or factored language models inte-
grated with syntactic tags (White et al., 2007).
• Log-linear models with different syntactic
and semantic features (Velldal and Oepen,
2005; Nakanishi et al., 2005; Cahill et al.,
2007).
To date, however, probabilistic models learn-
ing direct mappings from generation input to sur-
face strings, without the effort to construct a gram-
mar, have rarely been explored. An exception is
Ratnaparkhi (2000), who presents maximum en-
tropy models to learn attribute ordering and lexi-
cal choice for sentence generation from a semantic
representation of attribute-value pairs, restricted to
an air travel domain.