Sparse Coding for Sound Event Classification
Mingming Zhang
1,2
, Weifeng Li
1,2
, Longbiao Wang
3
, Jianguo Wei
4
, Zhiyong Wu
1,2
, Qingmin Liao
1,2
1
Shenzhen Key Lab. of Information Sci&Tech/Shenzhen Engineering Lab. of IS&DRM
2
Department of Electronic Engineering/Graduate School at Shenzhen, Tsinghua University, China
3
Nagaoka University of Technology, Japan
4
School of Computer Science and Technology, Tianjin University, China
Abstract—Generally sound event classification algorithms are
always based on speech recognition methods: feature-extraction
and model-training. In order to improve the classification
performance, researchers always pay much attention to find
more effective sound features or classifiers, which is obviously
difficult. In recent years, sparse coding provides a class of
effective algorithms to capture the high-level representation
features of the input data. In this paper, we present a sound
event classification method based on sparse coding and
supervised learning model. Sparse coding coefficients will be
used as the sound event features to train the classification model.
Experiment results demonstrate an obvious improvement in
sound event classification.
I. INTRODUCTION
The non-speech sound event classification has a wide use
in many important applications, such as music genre
classification [1-4], security surveillance [5], environment
detection [6-7], health care and so on. Generally, the system
of sound event detection and classification always uses the
methods derived from speech recognition, which in general
contains two steps: first, the sound event features are
extracted from labeled training sound, such as MFCC (Mel-
Frequency Cepstrum Coefficient, MFCC), PLP (Perceptual
Linear Predictive, PLP); second, the classifier is trained with
extracted features, such as SVM (Support Vector Machine,
SVM), GMM (Gaussian Mixture Model, GMM) and HMM
(Hidden Markov Model, HMM). A lot of related work has
been done in last twenty years. [8] relied on the use of
Wavelet transform technique for detection and on an
unsupervised order estimation of GMM. The basic idea of [9]
was to embed probabilistic distances into classical SVM to
classify the sound events. [10] presented an efficient robust
sound classification algorithm based on hidden Markov
models. While the literature [11] proposed a novel method for
feature extraction with spectrogram image feature. The most
difference between aforementioned methods is the different
combination of general features and classifiers.
Sparse coding is algorithm trying to find a high-level
representation of the input signal, which first introduced by
Olshausen [12]. It has to learn a dictionary called “basis
functions”, and the input signal can be represented by the
linear combination of the basis functions while the coefficient
vector is sparse. In recent years, sparse coding is paid more
and more attention in many research fields, especially image
processing such as image noise reduction, image restoration,
image classification and face recognition [13-14].
In audio signal processing, sparse coding can be used in
speaker Identification [15], speech recognition [16-17],
speech enhancement [18] and so on. Comparing with the
image processing, sparse coding has got less attention on the
use of audio signal processing, especially sound event
classification. [19] proposed a joint sparsity classification
method to exploit the inner correlation between observations
for acoustic signal classification. [20] presented an algorithm
for computing shift-invariant sparse coding (SISC) solutions
and applied it to audio classification. [21] employed the
sparse coding of auditory temporal modulations in music
genre classification. Sparse coding can represent each
example using a few non-zero coefficients and obtain a high-
level representation of the example, therefore the sparse
coefficients can be used as the new feature of sound event for
sound events classification with supervised learning.
In this paper, we propose to lean a high-level representation
of the input sound event features via sparse coding, and then
to train a supervised classifier for our classification task.
This paper is organized as follows: In section 2, the general
sparse coding algorithm is presented. In section 3, we give the
proposed method. Section 4 is our detailed experiment results
and the results analysis. Finally, we draw our conclusions in
section 5.
II. S
PARSE CODING
In this section, we will give a simple description of sparse
coding algorithm, including coefficient learning and
dictionary learning.
Given a signal sample
1m
R
×
∈
, and dictionary
mn
DR
×
∈
,
the signal x can be described by a linear combination of some
atoms of dictionary D as follows:
Ds= i
The sparse representation
1n
R
×
∈
of x can be estimated by the
following method:
2
2
(,)
min . . ( )
Ds
xDs st s
σ
−<i
2
,
.. 1
ij
j
tBcjn≤∀=
where D=[d
1
d
2
··· d
n
] is the dictionary with column vector d
j
of
the j
th
atom, and s is the coefficient vector. Therefore the
above sparse coding problem can be seen as a optimization
problem with constrain as follows:
2
2
min ( )
j
s
j
Ds s
βφ
−+
∑
i
2
,
.. 1
ij
j
tBcjn≤∀=
(1)
(2)
(3)