自动发音错误检测：最大F1分数判别训练法

168 浏览量更新于2024-08-26 收藏 119KB PDF 举报

"在计算机辅助语言学习中，自动错误识别是提升学习效果的关键技术之一，尤其是在发音训练领域。本文提出了一种新的判别训练准则，即最大F1分数判别训练，用于提升基于隐马尔可夫模型（HMM）的自动发音错误检测的性能。该方法将目标函数定义为在注释的非本地语音数据库上的F1分数的平滑形式，以优化错误检测的准确性和效率。" 在计算机辅助语言学习(CALL)中，自动发音错误检测对于语言学习者的发音改进至关重要。传统的监督学习方法，如HMM，依赖于概率模型来识别发音中的错误。然而，这些模型通常需要大量的标注数据进行训练，而获取这些数据往往成本高昂且耗时。为了解决这个问题，本文提出了一个新颖的策略，即利用最大F1分数作为判别训练的目标。 F1分数是评估分类任务性能的重要指标，它综合了精确率（Precision）和召回率（Recall）。在发音错误检测中，精确率表示被标记为错误的发音中真正错误的比例，而召回率则表示所有实际错误发音中被正确识别的比例。最大F1分数判别训练旨在同时优化这两个指标，从而达到最佳的整体性能。为了实现这一目标，作者们使用了基于弱感知辅助函数方法的扩展Baum-Welch形式的HMM更新方程。这种方法允许在不完全理解数据分布的情况下进行模型更新，从而更有效地适应不同的发音错误模式。此外，他们建议同时更新声学模型和电话阈值参数，这可以确保模型不仅对声学特征的识别更加敏感，还能对不同类型的错误做出适当的反应。实验结果证明了所提方法的有效性。在自动发音错误检测实验中，该方法显著提高了F1分数、精确率、召回率以及检测精度。这意味着学习者在使用这种系统时，其发音错误能被更准确、全面地识别出来，从而提供更有针对性的反馈，促进学习者的发音改善。最大F1分数判别训练为计算机辅助语言学习提供了一种新的优化工具，它能够提升自动错误识别系统的性能，帮助非母语者更有效地学习和纠正发音错误。这一研究对于开发更智能、更个性化的语言学习软件具有重要意义，为未来CALL系统的设计和改进提供了有价值的理论基础和实践指导。

Maximum F1-Score Discriminative Training for Automatic Mispronunciation

Detection in Computer-Assisted Language Learning

Hao Huang

, Jianming Wang

, Halidan Abudureyimu

1. Department of Information Science and Engineering, Xinjiang University, Urumqi, China

2. Department of Electrical Engineering, Xinjiang University, Urumqi, China

{huanghao,jmwang,halidana}@xju.edu.cn

Abstract

In this paper, we propose and evaluate a novel discriminative

training criterion for hidden Markov model (HMM) based auto-

matic mispronunciation detection in computer-assisted pronun-

ciation training. The objective function is formulated as a s-

mooth form of the F

-score on the annotated non-native speech

database. The objective function maximization is achieved by

using extended Baum Welch form like HMM updating equa-

tions based on the weak-sense auxiliary function method. Si-

multaneous updating of acoustic model and phone threshold

parameters is proposed to ensure objective improvement. Mis-

pronunciation detection experiments have shown the method is

effective in increasing the F

-score, Precision, Recall and de-

tection accuracy on both the training data and evaluation data.

Index Terms: automatic mispronunciation detection, F

-score,

discriminative training, computer-assisted language learning

1. Introduction

Computer assisted language learning that makes use of auto-

matic speech recognition technology has gained a growing in-

terest in the last two decades. Computer Assisted Pronunci-

ation Training (CAPT), which aims at helping the learner by

automatically pinpoint erroneous pronunciations, is one of the

most popularly deployed applications. Lots of research work

has been carried out and various mispronunciation detection

techniques have been proposed. While some new paradigm has

been explored [1], the HMM based acoustic modeling is the

mainstream. Within this framework, the frame-normalized log

posterior probability based conﬁdence score is a conventional

measurement of correctness.

In HMM based mispronunciation detection, the acoustic

models are often trained with maximum likelihood (ML) cri-

terion on native speech data to model the distributions of pro-

nunciation space of the standard speakers. In recent years, dis-

criminative training (DT) has been widely used in speech recog-

nition acoustic model training and has proved to give signiﬁcant

improvement over the traditional ML estimation method. The

most common DT methods are minimum classiﬁcation error

(MCE) [2] and maximum mutual information (MMI) [3] or its

variant minimum phone error (MPE) [4,5] training. For mispro-

nunciation detection, authors in [6] proposed a discriminative

training algorithm to jointly minimize mispronunciation detec-

tion errors and diagnosis errors. The acoustic models are reﬁned

under the minimum word error criterion. Actually, the above

mentioned methods focus on reducing the empirical recogni-

tion error (phone error or word error rate) on the training set.

In mispronunciation detection task, evaluation measurements of

performance can be diversiﬁed and entirely different from those

used in speech recognition task. The commonly used evalu-

ation criteria might include False Rejection (correct pronunci-

ation detected as incorrect), False Acceptances (errors detect-

ed as correct), True Acceptance (correct pronunciation detect-

ed as correct) and True Rejection (errors detected as incorrect).

Some work use Precision and Recall as the performance mea-

sure. These metrics can be effective, however, there has often

to be an empirical tradeoff among the multiple objectives.

The F

-score measure, a synthetic one-dimensional indi-

cator, is nowadays routinely used as an important performance

metric to estimate the performance of natural language process-

ing (NLP) or information retrieval (IR) systems. Recently, re-

searchers began to reﬁne system parameters by directly maxi-

mizing F

-score [7,8]. In mispronunciation detection, authors

in [9] began to use F

-score as a performance measurement. On

the other hand, the increasing amount of human annotated non-

native data makes it reasonable and feasible to directly reﬁne

system on large L2 speech corpus. Lots of work has been done

such as [1,10]. However, few methods that directly optimize the

HMM acoustic models in terms of F

-score, despite its popular-

ity, have been addressed so far. Inspired by these, we propose

a discriminative training algorithm for HMM based automatic

mispronunciation which aims at maximizing the empirical F

score on the annotated L2 speech data. A smooth version of

the F

-score objective function is proposed, which we denoted

as maximum F

-score criterion (MFC). Extended Baum-Welch

(EBW) form like HMM updating functions are derived using

the weak-sense auxiliary function method [5]. Mispronuncia-

tion detection experiments have shown the effectiveness of the

proposed method.

In section 2, the goodness of pronunciation (GOP) based

measurement [11] is brieﬂy reviewed. Section 3 discusses the

objective function and optimization. Section 4 presents the ex-

periments and the results. Section 5 draws the conclusion.

2. GOP based mispronunciation detection

The task of mispronunciation detection is to verify whether the

pronunciation of each phone is correct or not. GOP [11] is

the most conventionally used method. In this approach, con-

fusion network which includes canonical phone pronunciations

and any possible mispronunciations need to be built. This is

normally obtained by force alignment according to the canon-

ical transcription, and then all the possible pronunciation real-

izations are added. Given a set of acoustic observations of R

training utterances: O

, r = 1, · · · , R, let O

r,n

be the acoustic

observations of the nth phonetic segment in utterance r that is

composed of N

segments, and the canonical label of segment

(r, n) is denoted as q

r,n

, the GOP of phone segment (r, n) is

calculated as:

ISCA Archive

http://www.isca-speech.org/archive

INTERSPEECH 2012

ISCA's 13

Annual Conference

Portland, OR, USA

September 9-13, 2012

INTERSPEECH 2012

815

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38702844

粉丝: 2
资源: 921

自动发音错误检测：最大F1分数判别训练法

人脸识别入门数据集

毕业设计matlab手写数字识别

机器学习驱动的图像识别与广泛应用解析

OpenCV红绿灯识别机器学习模型训练与评估：打造智能交通系统

机器学习与图像处理结合：ImageFile库助力训练识别模型

【深度学习模型训练】：专家分享物体识别数据增强的黑科技

【多语言挑战】：扩展语音识别语言模型的多语言环境应用

源域选择指南：迁移学习在物体识别中的最佳实践

异常声音检测与处理：深度学习在语音识别中的新应用

【迁移学习在物体识别中的深度应用】：成为专家的10大策略与案例分析

最新资源