1 INTRODUCTION
There are plenty of pauses in the conversation. Speech
endpoint detection is a process that judges which are speech
segments and which are noise segments from speech signal
containing background noise, and finds out the beginning
and ending exactly[1-4]. Research shows that more than
half errors of speech recognition system come from
endpoint detection, and success or failure of speech
recognition system is mainly determined by the accuracy of
endpoint detection to some extent.
So far, the research of speech endpoint detection has been
developing for decades, and has generated a lot of methods,
but traditional energy and zero-crossing rate methods have
already no longer been robust under low signal-to-noise
ratios. In recent years, with the strong demand for practical
speech communication quality and speech recognition
technology, appears again many new methods. They
mainly use various new features to improve the
performance of anti-noise. For example, the method of
frequency band variance, the method of HMM model, the
feature of frequency domain energy, the feature of
information entropy, the features of differential energy and
differential zero-crossing rate, the feature of TF parameters,
the distance of auto correlated similarity, the feature of
higher-order statistics, the feature of short-time
energy-zero-produce, and the feature of discrimination
information[5-8].
Although speech endpoint detection systems have achieved
high performance under laboratory environments, the
performance is deteriorated dramatically with the influence
of the background noise and the transmission channel in
practical environments. For instance, the method of
frequency band variance will encounter some pulse
This work is supported by National Nature Science Foundation under
Grant
61403042 and is supported by Education Department of Liaoning
Province of China
L2013423
interference in the practical application, and in those region
can have large short-time features, so the threshold are
difficult to determine. Although the method of HMM
model has a high accuracy, it needs pre-trained model. The
method of information entropy can availably differentiate
sonant and noise of speech signal, but it is difficult to
differentiate unvoiced sound and noise. The method of
short-time energy-zero-produce is a very simple method,
but it uses fixed threshold that may lead to bad anti-noise
performance. The method of discrimination information is
regarded as a measurement of the similarity of signal and
noise, while it’s effect is not very good in low
signal-to-noise ratio conditions, but it’s effect is very good
in the case of the seriously noise environment. Therefore,
we proposed a new method that based on short-time
energy-zero-product and discrimination information, and
the method gives a precise and rapid endpoint detection in
the case of the seriously changed noise environment.
2 ALGORITHM DESCRIPTION
2.1 Description of Short-Time Energy-Zero-Produce
The product of short-time energy and corresponding
short-time zero-crossing rate is called short-time
energy-zero-produce. The definition of every frame
short-time energy
n
, short-time zero-crossing rate
n
,
and short-time energy-zero-produce
n
Z is respectively
[9]:
N1
2
0
()
nw
k
sk
−
=
=
¦
(1)
N
1
sgn[ ( )] sgn[ ( 1)]
nww
k
Zsksk
=
=−−
¦
(2)
nnn
ZEZ=∗ (3)
Research on Speech Endpoint Detection under Low Signal-to-Noise Ratios
HAN Zhiyan, WANG Jian
College of Engineering, Bohai University, Jinzhou 121000
E-mail: hanzyme@126.com
Abstract: A novel speech endpoint detection algorithm was proposed to improve the accuracy in low signal-to-noise
ratio (SNR) conditions.
Core technology was based on the complementarity between the short-time energy-zero-product
and discrimination information, which used short-time energy-zero-product algorithm to make judgment firstly, and then
used discrimination information based on the sub-band energy distribution probabilities algorithm to recheck when met
with the transition for noise frame and speech frame, so as to avoided error-detected owing to the sharp change of noise
amplitude and the ending speech frames which were polluted by noise.
Moreover, we proposed a novel dynamically
update the noise energy threshold algorithm, which could trace the changes for noise energy better. The simulation
experimental results show that the new method gives a precise and rapid endpoint detection in the case of the seriously
changed noise environment, and it plays a very good foreshadowing role in the latter speech research.
Key Words: Speech Signal, Endpoint Detection, Short-Time Energy-Zero-Product, Discrimination Information
3635
978-1-4799-7016-2/15/$31.00
c
2015 IEEE