Discriminative Training Methods for Hidden Markov Mo dels:
Theory and Exp eriments with Perceptron Algorithms
Michael Collins
AT&T Labs-Research, Florham Park, New Jersey.
mcollins@research.att.com
Abstract
We describ e new algorithms for train-
ing tagging mo dels, as an alternative
to maximum-entropy models or condi-
tional random elds (CRFs). The al-
gorithms rely on Viterbi deco ding of
training examples, combined with sim-
ple additive up dates. We describ e the-
ory justifying the algorithms through
a modication of the proof of conver-
gence of the p erceptron algorithm for
classication problems. We give exper-
imental results on part-of-sp eech tag-
ging and base noun phrase chunking, in
b oth cases showing improvements over
results for a maximum-entropy tagger.
1 Intro duction
Maximum-entropy (ME) models are justiably
a very p opular choice for tagging problems in
Natural Language Pro cessing: for example see
(Ratnaparkhi 96) for their use on part-of-sp eech
tagging, and (McCallum et al. 2000) for their
use on a FAQ segmentation task. ME mo dels
have the advantage of b eing quite exible in the
features that can b e incorp orated in the mo del.
However, recent theoretical and exp erimental re-
sults in (Laerty et al. 2001) have highlighted
problems with the parameter estimation metho d
for ME mo dels. In resp onse to these problems,
they describ e alternative parameter estimation
metho ds based on Conditional Markov Random
Fields (CRFs). (Laerty et al. 2001) give exp er-
imental results suggesting that CRFs can p er-
form signicantly b etter than ME mo dels.
In this pap er we describ e parameter estima-
tion algorithms which are natural alternatives to
CRFs. The algorithms are based on the p ercep-
tron algorithm (Rosenblatt 58), and the voted
or averaged versions of the p erceptron describ ed
in (Freund & Schapire 99). These algorithms
have b een shown by(Freund & Schapire 99) to
b e comp etitive with mo dern learning algorithms
such as supp ort vector machines; however, they
have previously b een applied mainly to classi-
cation tasks, and it is not entirely clear how the
algorithms can be carried across to NLP tasks
such as tagging or parsing.
This paper describ es variants of the p ercep-
tron algorithm for tagging problems. The al-
gorithms rely on Viterbi deco ding of training
examples, combined with simple additive up-
dates. We describ e theory justifying the algo-
rithm through a modication of the pro of of con-
vergence of the p erceptron algorithm for classi-
cation problems. We give exp erimental results
on part-of-sp eech tagging and base noun phrase
chunking, in both cases showing improvements
over results for a maximum-entropy tagger (a
11.9% relative reduction in error for POS tag-
ging, a 5.1% relative reduction in error for NP
chunking). Although we concentrate on tagging
problems in this pap er, the theoretical frame-
work and algorithm describ ed in section 3 of
this pap er should be applicable to a wide va-
riety of mo dels where Viterbi-style algorithms
can b e used for deco ding: examples are Proba-
bilistic Context-Free Grammars, or ME mo dels
for parsing. See (Collins and Duy 2001; Collins
and Duy 2002; Collins 2002) for other applica-
tions of the voted p erceptron to NLP problems.
1
2 Parameter Estimation
2.1 HMM Taggers
In this section, as a motivating example, we de-
scrib e a sp ecial case of the algorithm in this
pap er: the algorithm applied to a trigram tag-
ger. In a trigram HMM tagger, each trigram
1
The theorems in section 3, and the proofs in sec-
tion 5, apply directly to the work in these other papers.
Association for Computational Linguistics.
Language Processing (EMNLP), Philadelphia, July 2002, pp. 1-8.
Proceedings of the Conference on Empirical Methods in Natural
1