A Multi-channel/Multi-speaker Articulatory Database in Mandarin for Speech
Visualization
Dan Zhang
1,2,3
, Xianqian Liu
1,2
, Nan Yan
1,2
, Lan Wang
1,2
, Yun Zhu
1,2
and Hui Chen
4
1
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
2
The Chinese University of Hong Kong, Hong Kong, China
3
School of Information Science and Technology, University of Science and Technology of China,
Hefei, China
4
Institute of Software, Chinese Academy of Sciences
{dan.zhang, liu.xq, nan.yan, lan.wang, yun.zhu}@siat.ac.cn, chenhui@iscas.ac.cn
Abstract
The application of articulatory database in speech production
and automatic speech recognition has been practiced for many
years. The goal of the research was to build an articulatory
database specifying in Chinese Mandarin production and to
investigate its efficacy in speech animation. Carstens EMA
AG501 device were respectively used to capture acoustic data
and articulatory data. Also, a Microsoft Kinect camera was
applied to capture face-tracking data as a supplement. Finally,
we tried several methods to extract acoustic parameters and
built up a 3D talking head model to verify the efficacy of the
database.
Index Terms: articulatory database, Mandarin, EMA, Kinect
camera
1. Introduction
In recent years, it has witnessed growing interest in applying
articulatory parameters into automatic speech recognition and
speech production. Research conducted by ASR groups have
been focusing on either to infer articulatory features from the
acoustic data using vocal models [1-3], or from linguistic
rules[4-6], or from restricted articulatory rules[7-8].
Additionally, speech production researches have been
investigating the speech animation models and methods.
Research group in Edinburgh has firstly set up a speech
production database named MONCA database [9]. The
database includes articulatory data collected by using Carstens
EMA AG100, Laryngograph and EPG implementation. The
MONCA database is mainly applied in ASR and 3D speech
animation. Qin, C. et al. [10] proposed an algorithm using the
MONCA database to recover realistic tongue contours from
articulatory databases, based on 2D coordinates. Another
database is set up in KTH, majoring in 3D speech animation.
An example called ARTUR is developed by Olle Bälter et al.
[11] which specifies in speech animation and computer aided
speech learning. By using ARTUR, Children can improve
their pronunciation during the session. Additionally, the
studies in favor of articulatory phonology suggest that
variation in the extent and timing of articulatory gestures can
account for many segmental deletions and assimilations
commonly encountered in casual speech [12]. This provides a
theoretical basis for supposing that articulatory parameters
could prove to be more robust to inter- and intra-speaker
variability.
Despite progress in speech production research, an articulatory
database in Mandarin is still deficiency. Mandarin language
differs from Latin Languages in many ways, especially in the
construction of syllables and words. Chinese words are mainly
monosyllabic while words in Latin Languages always are
polysyllabic. With the widespread use of articulatory
measurement tools in speech production research, we want to
build a database specifying in Mandarin pronunciation. Also,
database of Mandarin that describes the acoustic and
kinematic features in Mandarin pronunciation is necessary in
audio-visual language animation. These parameters should
include acoustic parameters such as formant frequencies and
kinematic characteristics such as tongue position, lips position
and jaw position equally. The database should be divided into
3 levels-phoneme level, word level and sentence level, with
each level containing major vowels and consonants in
Mandarin. To verify the usability and efficacy of the corpus,
we conducted a series of experiments aiming at producing
variable kinds of pronunciation-related data and speech
animations.
In the present study, a Carstens AG-501 electromagnetic
articulograph (EMA) device is used to collect kinematic data
and speech data. EMA can be adopted to record articulatory
movements from tiny sensors that can be placed on the tongue,
lip, or anywhere else on the face during speech production or
to record a stream of speech synchronously. Moreover, EMA
provides a high resolution of data acquisition that allows the
collection of vowels and consonants as well. In our experiment,
we try to control the length of data acquisition to less than one
hour, in an effort to reduce the pain of speakers.
In addition, a Microsoft-Kinect camera is also applied to the
data collection process, which specifies in collecting facial
articulators’ movements in speech. Microsoft Kinect SDK
provides the Face Tracking methods and demos which could
be directly applied to our experiment. The Kinect camera
acquires position data of meshed face-tracking results at a rate
of 20 frames per second. Totally, there are 121 tracking points
which could be located on a virtual reality human face, which
is sufficient for most cases in 3D speech animation
applications. The face-tracking model is a useful supplement
to EMA articulatory database.
2. Corpus and Subject
2.1 Design of Recording Sessions
Our speech corpuses are organized with balances in many
ways regarding the need for investigating various kinds of
speech pronunciation and reasons leading to wrong
pronouncing. Requirements for such balances include:
1. Numbers of male and female speakers are approximately
equal.