Maximum F1-Score Discriminative Training for Automatic Mispronunciation
Detection in Computer-Assisted Language Learning
Hao Huang
1
, Jianming Wang
1
, Halidan Abudureyimu
2
1. Department of Information Science and Engineering, Xinjiang University, Urumqi, China
2. Department of Electrical Engineering, Xinjiang University, Urumqi, China
{huanghao,jmwang,halidana}@xju.edu.cn
Abstract
In this paper, we propose and evaluate a novel discriminative
training criterion for hidden Markov model (HMM) based auto-
matic mispronunciation detection in computer-assisted pronun-
ciation training. The objective function is formulated as a s-
mooth form of the F
1
-score on the annotated non-native speech
database. The objective function maximization is achieved by
using extended Baum Welch form like HMM updating equa-
tions based on the weak-sense auxiliary function method. Si-
multaneous updating of acoustic model and phone threshold
parameters is proposed to ensure objective improvement. Mis-
pronunciation detection experiments have shown the method is
effective in increasing the F
1
-score, Precision, Recall and de-
tection accuracy on both the training data and evaluation data.
Index Terms: automatic mispronunciation detection, F
1
-score,
discriminative training, computer-assisted language learning
1. Introduction
Computer assisted language learning that makes use of auto-
matic speech recognition technology has gained a growing in-
terest in the last two decades. Computer Assisted Pronunci-
ation Training (CAPT), which aims at helping the learner by
automatically pinpoint erroneous pronunciations, is one of the
most popularly deployed applications. Lots of research work
has been carried out and various mispronunciation detection
techniques have been proposed. While some new paradigm has
been explored [1], the HMM based acoustic modeling is the
mainstream. Within this framework, the frame-normalized log
posterior probability based confidence score is a conventional
measurement of correctness.
In HMM based mispronunciation detection, the acoustic
models are often trained with maximum likelihood (ML) cri-
terion on native speech data to model the distributions of pro-
nunciation space of the standard speakers. In recent years, dis-
criminative training (DT) has been widely used in speech recog-
nition acoustic model training and has proved to give significant
improvement over the traditional ML estimation method. The
most common DT methods are minimum classification error
(MCE) [2] and maximum mutual information (MMI) [3] or its
variant minimum phone error (MPE) [4,5] training. For mispro-
nunciation detection, authors in [6] proposed a discriminative
training algorithm to jointly minimize mispronunciation detec-
tion errors and diagnosis errors. The acoustic models are refined
under the minimum word error criterion. Actually, the above
mentioned methods focus on reducing the empirical recogni-
tion error (phone error or word error rate) on the training set.
In mispronunciation detection task, evaluation measurements of
performance can be diversified and entirely different from those
used in speech recognition task. The commonly used evalu-
ation criteria might include False Rejection (correct pronunci-
ation detected as incorrect), False Acceptances (errors detect-
ed as correct), True Acceptance (correct pronunciation detect-
ed as correct) and True Rejection (errors detected as incorrect).
Some work use Precision and Recall as the performance mea-
sure. These metrics can be effective, however, there has often
to be an empirical tradeoff among the multiple objectives.
The F
1
-score measure, a synthetic one-dimensional indi-
cator, is nowadays routinely used as an important performance
metric to estimate the performance of natural language process-
ing (NLP) or information retrieval (IR) systems. Recently, re-
searchers began to refine system parameters by directly maxi-
mizing F
1
-score [7,8]. In mispronunciation detection, authors
in [9] began to use F
1
-score as a performance measurement. On
the other hand, the increasing amount of human annotated non-
native data makes it reasonable and feasible to directly refine
system on large L2 speech corpus. Lots of work has been done
such as [1,10]. However, few methods that directly optimize the
HMM acoustic models in terms of F
1
-score, despite its popular-
ity, have been addressed so far. Inspired by these, we propose
a discriminative training algorithm for HMM based automatic
mispronunciation which aims at maximizing the empirical F
1
-
score on the annotated L2 speech data. A smooth version of
the F
1
-score objective function is proposed, which we denoted
as maximum F
1
-score criterion (MFC). Extended Baum-Welch
(EBW) form like HMM updating functions are derived using
the weak-sense auxiliary function method [5]. Mispronuncia-
tion detection experiments have shown the effectiveness of the
proposed method.
In section 2, the goodness of pronunciation (GOP) based
measurement [11] is briefly reviewed. Section 3 discusses the
objective function and optimization. Section 4 presents the ex-
periments and the results. Section 5 draws the conclusion.
2. GOP based mispronunciation detection
The task of mispronunciation detection is to verify whether the
pronunciation of each phone is correct or not. GOP [11] is
the most conventionally used method. In this approach, con-
fusion network which includes canonical phone pronunciations
and any possible mispronunciations need to be built. This is
normally obtained by force alignment according to the canon-
ical transcription, and then all the possible pronunciation real-
izations are added. Given a set of acoustic observations of R
training utterances: O
r
, r = 1, · · · , R, let O
r,n
be the acoustic
observations of the nth phonetic segment in utterance r that is
composed of N
r
segments, and the canonical label of segment
(r, n) is denoted as q
r,n
, the GOP of phone segment (r, n) is
calculated as:
ISCA Archive
http://www.isca-speech.org/archive
INTERSPEECH 2012
ISCA's 13
th
Annual Conference
Portland, OR, USA
September 9-13, 2012