IET Signal Processing
Research Article
Dictionary learning based on M-PCA-N for
audio signal sparse representation
ISSN 1751-9675
Received on 15th March 2016
Revised 7th September 2017
Accepted on 12th September 2017
E-First on 6th October 2017
doi: 10.1049/iet-spr.2015.0277
www.ietdl.org
Jichen Yang
1,2
, Qianhua He
2
, Yanxiong Li
2
, Leian Liu
1
, Jianhong Li
3
, Xiaohui Feng
2
1
School of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Zhongkai Road No. 1, Haizhu District,
Guangzhou, People's Republic of China
2
School of Electronic and Information Engineering, South China University of Technology, Wushan Road No.381, Tianhe District, Guangzhou,
People's Republic of China
3
Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou Higher Education Mega Center,
Panyu District, Guangzhou, People's Republic of China
E-mail: eeyxli@scut.edu.cn
Abstract: The current popular dictionary learning algorithms for sparse representation of signals are K-means Singular Value
Decomposition (K-SVD) and K-SVD-extended. Only rank-1 approximation is used to update one atom at a time and it is unable
to cope with large dictionary efficiently. In order to tackle these two problems, this study proposes M-Principal Component
Analysis-N (M-PCA-N), which is an algorithm for dictionary learning and sparse representation. First, M-Principal Component
Analysis (M-PCA) utilised information from the top M ranks of SVD decomposition to update M atoms at a time. Then, in order to
further utilise the information from remaining ranks, M-PCA-N is proposed on the basis of M-PCA, by transforming information
from the following N non-principal ranks onto the top M principal ranks. The mathematic formula indicates that M-PCA may be
seen as a generalisation of K-SVD. Experimental results on the BBC Sound Effects Library show that M-PCA-N not only lowers
the MSE between original signal and approximation signal in audio signal sparse representation, but also obtains higher audio
signal classification precision than K-SVD.
1 Introduction
Sparse representations of signals have received continuous
attention in recent years [1–4]. In signal processing, the most
appropriate linear combination atoms from an overcomplete
dictionary are used to get an approximation of the signal. Highly
natural signals with substantial characteristics close to that of
acoustic signals may be achieved in this manner. It allows for a
large amount of information to be represented by a small number
of exemplary signals and is widely used in signal processing in
areas such as communication [5, 6], compressive sampling [1, 7],
ground penetrating radar signals classifications [8], audio
inpainting [9], image sparse representation [10–13], signal
recovery [14–16], denoising [17, 18], deblurring [19], compression
[20], source separation, mapping, classification [21], synthetic
aperture radar target recognition [22, 23], image classification [24,
25], face recognition [26, 27], voice conversion [28, 29] and so on.
Supposing dictionary matrix D ∈ R
n × K
, which has K atoms
d
j j = 1
K
, for a signal
Y ∈ R
n
, the goal of sparse representation may
either be exact Y = DX or approximate, Y ≃ DX, satisfying
∥ Y − DX ∥
F
2
≤ ε, the vector X ∈ R
K
contains the representation
coefficients of signal Y [11], which also can use the following
formula:
min
D, X
∥ Y − D X ∥
F
2
subjectto ∀i min ∥ X
i
∥
0
≤ T
0
(1)
where T
0
is target sparsity.
Dictionary fitting is not trivial. Some work has been carried out
in the area over the recent decades. Around 1999, method of
optimal directions (MOD) by Engan et al. [30–32] was represented
to update the dictionary. However MOD requires the computation
of the inversion matrix in the process of updating the dictionary
and this is almost impracticable for large dictionaries [11]. In 2006,
K-SVD was presented to learn D using Y [11] in image signal
processing by using (SVD) [33]. This is an iterative method that
alternates between sparse coding of examples based on the current
dictionary with an updating process for dictionary atoms so as to
better fit the data. Thus, the sparse representation is updated
together with the dictionary columns. Extended K-means Singular
Value Decomposition (EK-SVD) [12] was presented in 2008 on the
basis of K-SVD with a dictionary-optimisation process. EK-SVD
starts with a large number of dictionary elements and
systematically prunes the under-utilised or similar-looking
elements to produce a well-trained dictionary with no redundant
elements [12].
In summary, dictionary learning consists of two stages: sparse
coding and dictionary updating. Sparse coding is the process of
computing the representation coefficients X based on the given
signal Y and dictionary
D [11]. Dictionary updating is the process
of updating the atoms in the dictionary by using data.
The earliest work in the field of sparse coding is [34], where
Mallat and Zhang proposed matching pursuit (MP) in 1993, which
is an iterative algorithm that decomposes the residue by projecting
it on a dictionary vector that matches the residue very well at every
step of decomposition. Here, previously obtained atoms of signal
decomposition are not orthogonal to one another; in the latter
signal decomposition, however, the residual vector will have
components in the space which is formed by the selected atoms. To
solve the problem, Pati et al. [35] proposed orthogonal MP (OMP),
a modified MP in 1993, which maintains full backward
orthogonality of residual error at every step and thereby leads to
improved convergence. On the basis of OMP, some OMP-extended
algorithms such as look ahead OMP [36, 37], generalised OMP
[38] and stagewise OMP [39] were proposed. MP remains the
simplest method in sparse coding and OMP is the most popular
method because of its effectiveness.
Conventional methods of dictionary updating, whether K-SVD
or EK-SVD, only use the rank-1 approximation to update an atom
at a time, neglecting the information from other ranks. They are
thus unable to cope with large dictionaries efficiently. These
algorithms are all used for image signal processing. There is more
variation in audio signal processing compared to image signal
IET Signal Process., 2018, Vol. 12 Iss. 2, pp. 198-206
© The Institution of Engineering and Technology 2017
198