Morphological normalization of vowel images for articulatory speech
recognition
q
Jianguo Wei
a,b
, Jingshu Zhang
b
, Yan Ji
b
, Qiang Fang
c
, Wenhuan Lu
a,
⇑
a
School of Computer Software, Tianjin University, 135 Yaguan Road, Jin Nan District, Tianjin 300350, China
b
Tianjin Key Laboratory of Cognitive Computing and Application, School of Computer Science and Technology, Tianjin University, 135 Yaguan Road, Jin Nan District, Tianjin
300350, China
c
Chinese Academy of Social Sciences, Beijing, China
article info
Article history:
Received 15 March 2016
Revised 30 June 2016
Accepted 12 October 2016
Available online 17 October 2016
Keywords:
Vocal tract normalization
Articulatory data
Acoustic data
Thin-Plate Spline
DNN
Articulatory recognition
abstract
Minimizing morphological variances of the vocal tract across speakers is a challenge for articulatory anal-
ysis and modeling. In order to reduce morphological differences in speech organs among speakers and
retain speakers’ speech dynamics, our study proposes a method of normalizing the vocal-tract shapes
of Mandarin and Japanese speakers by using a Thin-Plate Spline (TPS) method. We apply the properties
of TPS in a two-dimensional space in order to normalize vocal-tract shapes. Furthermore, we also use
DNN (Deep Neural Networks) based speech recognition for our evaluations. We obtained our template
for normalization by measuring three speakers’ palates and tongue shapes. Our results show a reduction
in variances among subjects. The similar vowel structure of pre/post-normalization data indicates that
our framework retains speaker specific characteristics. Our results for the articulatory recognition of iso-
lated phonemes show an improvement of 25%. Moreover, our phone error rate of continuous speech
reduced by 5.84%.
Ó 2016 Elsevier Inc. All rights reserved.
1. Introduction
In recent years, speech recognition technology has advanced
significantly. Speaker adaptive and system robustness factors
remain vital to speech recognition systems. Interestingly, much
articulatory data used for speech research is also used for acoustic
data [1]. However, articulatory data are not widely applied. One
reason is that acquiring such data is difficult. Another reason is that
variances in vocal tracts prove difficult for usage in multi-subject
articulatory data research [2]. Hence, articulatory data are not as
popular as acoustic data in spite of its importance in the speech
research field. In order to discover the kinematic properties that
characterize speaker differences, it is necessary to normalize
inter-subject articulatory data so that morphological variances
among different speakers are reduced.
As such, it is important to understand that there are differences
in vocal tracts among subjects, and that large nonlinear deforma-
tions can occur on vocal tracts. Therefore, it is difficult to study
vocal tract shape by affine transformation of simple rigid objects.
Up to now, researchers have proposed many normalization tech-
niques for articulatory space and acoustic space. For instance,
Bechman et al. [3] proposed straightening the walls of vocal tracts
in order to transform the coordinates of x-rays into micro beam
data. Hashi et al. [4] also proposed a method of normalizing vowel
postures for an X-ray micro beam database. The two methods both
straighten vocal tract walls in order to normalize vocal tract
length; however, this can cause the relative relationship between
the palate and tongue surface to change significantly after transfor-
mation. Pitz et al., in a study concerning acoustic space, processed
the length of vocal tracts by using linear transformation in a fre-
quency domain [5]. Additionally, Saheer et al. normalized the
length of the vocal tract by using a linear transformation method
[6]. Among these studies, it is evident that they all attempt to nor-
malize vocal length tract length (in either articulatory or acoustic
space) without considering the articulatory features of vocal tract
shapes.
Because the vocal tract shape usually reflects local and nonlin-
ear deformations, it can be treated as a kind of non-rigid shape
deformation. Based on this idea, our study proposes a framework
of normalizing speakers’ EMA (Electromagnetic Midsagittal Articu-
lographic) data by using a TPS (Thin-Plate Spline warping) method
[7] (a non-linear transformation method applied in shape
http://dx.doi.org/10.1016/j.jvcir.2016.10.005
1047-3203/Ó 2016 Elsevier Inc. All rights reserved.
q
This paper has been recommended for acceptance by Zicheng Liu.
⇑
Corresponding author.
E-mail addresses: jianguo@tju.edu.cn (J. Wei), jingshu@tju.edu.cn (J. Zhang),
tjujiyan@tju.edu.cn (Y. Ji), fangqiang@cass.org.cn (Q. Fang), wenhuan@tju.edu.cn
(W. Lu).
J. Vis. Commun. Image R. 41 (2016) 352–360
Contents lists available at ScienceDirect
J. Vis. Commun. Image R.
journal homepage: www.elsevier.com/locate/jvci