Dependency-Based N-Gram Models for
General Purpose Sentence Realisation
Yuqing Guo
NCLT, School of Computing
Dublin City University
Dublin 9, Ireland
Josef van Genabith
NCLT, School of Computing
Dublin City University
IBM CAS, Dublin, Ireland
Haifeng Wang
Toshiba (China)
Research & Development Center
Beijing, 100738, China
We present dependency-based n-gram
models for general-purpose, wide-
coverage, probabilistic sentence realisa-
tion. Our method linearises unordered
dependencies in input representations
directly rather than via the application
of grammar rules, as in traditional chart-
based generators. The method is simple,
efficient, and achieves competitive accu-
racy and complete coverage on standard
English (Penn-II, 0.7440 BLEU, 0.05
sec/sent) and Chinese (CTB6, 0.7123
BLEU, 0.14 sec/sent) test data.
1 Introduction
Sentence generation,
or surface realisation can be
described as the problem of producing syntacti-
cally, morphologically, and orthographically cor-
rect sentences from a given semantic or syntactic
Most general-purpose realisation systems de-
veloped to date transform the input into sur-
face form via the application of a set of gram-
mar rules based on particular linguistic theories,
e.g. Lexical Functional Grammar (LFG), Head-
Driven Phrase Structure Grammar (HPSG), Com-
binatory Categorial Grammar (CCG), Tree Ad-
joining Grammar (TAG) etc. These grammar rules
are either carefully handcrafted, as those used in
FUF/SURGE (Elhadad, 1991), LKB (Carroll et al.,
In this paper, the term “generation” is used generally for
what is more strictly referred to by the term “tactical genera-
tion” or “surface realisation”.
1999), OpenCCG (White, 2004) and XLE (Crouch
et al., 2007), or created semi-automatically (Belz,
2007), or fully automatically extracted from an-
notated corpora, like the HPSG (Nakanishi et
al., 2005), LFG (Cahill and van Genabith, 2006;
Hogan et al., 2007) and CCG (White et al.,
2007) resources derived from the Penn-II Treebank
(PTB) (Marcus et al., 1993).
Over the last decade, probabilistic models have
become widely used in the field of natural lan-
guage generation (NLG), often in the form of a re-
alisation ranker in a two-stage generation architec-
ture. The two-stage methodology is characterised
by a separation between generation and selection,
in which rule-based methods are used to generate a
space of possible paraphrases, and statistical meth-
ods are used to select the most likely realisation
from the space. By and large, two statistical mod-
els are used in the rankers to choose output strings:
• N-gram language models over different units,
such as word-level bigram/trigram mod-
els (Bangalore and Rambow, 2000; Langk-
ilde, 2000), or factored language models inte-
grated with syntactic tags (White et al., 2007).
• Log-linear models with different syntactic
and semantic features (Velldal and Oepen,
2005; Nakanishi et al., 2005; Cahill et al.,
To date, however, probabilistic models learn-
ing direct mappings from generation input to sur-
face strings, without the effort to construct a gram-
mar, have rarely been explored. An exception is
Ratnaparkhi (2000), who presents maximum en-
tropy models to learn attribute ordering and lexi-
cal choice for sentence generation from a semantic
representation of attribute-value pairs, restricted to
an air travel domain.