Characteristics-based effective applause detection for
meeting speech
Yan-Xiong Li
a,
, Qian-Hua He
a
, Sam Kwong
b
, Tao Li
a
, Ji-Chen Yang
a
a
School of Electronic and Information Engineering, South China University of Technology, 381 Wushan Road, Guangzhou 510640, Guangdong Province, China
b
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong, China
article info
Article history:
Received 22 May 2008
Received in revised form
8 December 2008
Accepted 2 March 2009
Available online 10 March 2009
Keywords:
Applause characteristics
Applause detection
Meeting speech
Spontaneous speech recognition
abstract
Applause frequently occurs in multi-participants meeting speech. In fact, detecting
applause is quite important for meeting speech recognition, semantic inference,
highlight extraction, etc. In this paper, we will first study the characteristic differences
between applause and speech, such as duration, pitch, spectrogram and occurrence
locations. Then, an effective algorithm based on these characteristics is proposed for
detecting ap plause in meeting speech stream. In the algorithm, the non-silence signal
segments are first extracted by using voice activity detection. Afterward, applause
segments are detected from the non-silence signal segments based on the characteristic
differences between applause and speech without using any complex statistical models,
such as hidden Markov models. The proposed algorithm can accurately determine the
boundaries of applause in meeting speech stream, and is also computationally efficient.
In addition, it can extract applause sub-segments from the mixed segments.
Experimental evaluations show that the proposed algorithm can achieve satisfactory
results in detecting applause of the meeting speech. Precision rate, recall rate, and
F1-measure are 94.34%, 98.04%, and 96.15%, respectively. When compared with the
traditional algorithm under the same experimental conditions, 3.62% improvement in
F1-measure is achieved, and about 35.78% of computational time is saved.
& 2009 Elsevier B.V. All rights reserved.
1. Introduction
Recently, many researchers are interested in the
processing of meeting speech [1–4] due to the rapid
increase of multimedia materials. In a multi-participants
meeting, attendants can use applause (a loud sound
generated by repeatedly striking the palms of hands) to
welcome or appreciate speaker’s speech. Therefore,
applause can be regarded as a communication medium
between speaker and audience, and it can definitely
indicate certain outcomes of speaker’s speech. The
motivation of this paper is based on the following reasons:
First, applause frequently occurs in meetings. Secondly,
detecting applause is helpful for many applications, such
as meeting speech recognition, speech emotion recogni-
tion, semantic inference, and highlight extraction.
There exists some recent work in detecting or proces-
sing applause for various applications. In Ref. [5], applause
and the other sounds were modeled with hidden Markov
models (HMMs), and a grammar network was proposed to
connect various models to fully explore the transitions
among these different kinds of sounds. Features, such as
short-time energy (STE), zero-crossing rate (ZCR), band
energy ratios (BERs), mel-frequency cepstral coefficients
(MFCCs), were extracted as feature vector for training the
HMMs. In Ref. [6], MPEG-7 features (dimension reduced
spectral vectors obtained using a linear transformation of
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/sigpro
Signal Processing
ARTICLE IN PRESS
0165-1684/$ - see front matter & 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.sigpro.2009.03.0 01
Corresponding author. Tel.: +86 020 87113544, +8615915766896;
fax: +86 020 87112470.
E-mail addresses: li.yanxiong@mail.scut.edu.cn,
yanxiongli@163.com (Y.-X. Li), eeqhhe@scut.edu.cn (Q.-H. He),
CSSAMK@cityu.edu.hk (S. Kwong), litao@scut.edu.cn (T. Li),
NisonYoung@yahoo.cn (J.-C. Yang).
Signal Processing 89 (2009) 1625–1633