A Fast Proximal Method for Convolutional Sparse Coding
Rakesh Chalasani Jose C. Principe Naveen Ramakrishnan
Abstract—Sparse coding, an unsupervised feature learning
technique, is often used as a basic building block to construct
deep networks. Convolutional sparse coding is proposed in the
literature to overcome the scalability issues of sparse coding
techniques to large images. In this paper we propose an efficient
algorithm, based on the fast iterative shrinkage thresholding
algorithm (FISTA), for learning sparse convolutional features.
Through numerical experiments, we show that the proposed
convolutional extension of FISTA can not only lead to faster
convergence compared to existing methods but can also easily
generalize to other cost functions.
Index Terms—Convolution, Sparse Coding, Feature Extrac-
tion, Unsupervised Learning.
I. INTRODUCTION
Choosing the appropriate data representation (i.e. feature
space) that has desirable properties for a given task (e.g.
classification, clustering) is a central issue in statistical
machine learning. This issue becomes all the more important
for computer vision tasks (e.g. object detection) due to
high dimensionality of input features and high variability
in instances. Hand crafted features like SIFT [1], HoG [2],
etc., have been successful to over come these issues. Recent
developments in deep architectures, where multiple layers
of feature extractors (e.g. RBM, sparse coding blocks) are
stacked, attempt to learn the features automatically in an
unsupervised manner using large-scale unlabeled data [3],
[4]. Some of the popular deep networks use sparse encoders
as building blocks [5], [6], [7] and hence, in this work we
focus on these sparse coding blocks.
Generally, sparse coding is based on the idea that an
observation, y ∈ R
p
, can be encoded using an over-complete
dictionary of filters, C ∈ R
p×k
; (k > p) and a sparse vector
x ∈ R
k
. More formally this can be written as
ˆ
x = arg min
x
1
2
∥y − Cx∥
2
2
+ λ∥x∥
1
(1)
The ℓ
1
-norm on x ensures that the latent vector is sparse.
Several efficient solvers like coordinate descent (CoD) [8],
fast iterative shrinkage thresholding algorithm (FISTA) [9],
feature-sign algorithm [10], etc, can be readily applied to
solve the above optimization problem. The dictionary C can
also be learned from the data [10], [11].
However, in most applications of sparse coding many over-
lapping patches across the image are processed separately.
Rakesh Chalasani and Jose C. Principe are with the Department of Electri-
cal and Computer Engineering, University of Florida, Gainesville, FL, USA .
Naveen Ramakrishnan is with Robert Bosch LLC, Research and Technology
Center North America, Pittsburgh, PA, USA. (email: rakeshch@ufl.edu,
principe@cnel.ufl.edu,Naveen.Ramakrishnan@us.bosch.com)
This work is partially supported by ONR grant #N 000141010375. The
first author performed part of this work while at Bosch RTC, Pittsburgh
during summer 2012.
This is often too slow in practice, making it difficult to scale
to large images. Moreover, sparse coding alone is not capable
of encoding translations in the observations. Learning the
dictionary in this context produces several shifted versions
of the same filter, such that each patch can be reconstructed
individually [12]. During inference, when performed on all
the overlapping patches, this can lead to a very redundant
representation. To over come these limitations, convolutional
sparse coding is proposed [5], [12]. Here sparse coding
is applied over the entire image and the dictionary is a
convolutional filter bank with ‘M’ kernels such that,
x = arg min
x
1
2
∥I −
M
m=1
C
m
∗ x
m
∥
2
2
+ λ
M
m=1
∥x
m
∥
1
(2)
where I is an image of size (w × h), C
m
is a filter kernel
of size (s ×s) in a dictionary, x
m
is a sparse matrix of size
(w + s − 1) × (h + s − 1), λ is the sparsity parameter and
’∗’ represents a 2D convolution operator
1
.
In this work, we propose an algorithm to solve the opti-
mization problem in (2) efficiently and which scales to large
images. This is an extension to fast iterative shrinkage thresh-
olding algorithm (FISTA) [9] and is solved using proximal
gradient. In addition, we also extend our method to include
a feed-forward predictor (predictive sparse decomposition
[6]) in the cost for joint optimization during inference and
learning.
Convolutional sparse coding is also previously studied
in [5], [12]. Zeiler et. al. [5] proposed a method to solve
the optimization problem in (2) by introducing additional
auxiliary variable which demands solving a large linear
system (the size of the linear system is proportional to size of
the image) at every iteration; although the complexity could
be reduced by using conjugate gradient, their approach does
not scale to large images. On the other hand, the authors in
[12] proposed a convolutional extension to CoD where the
computation per iteration is small compared to the method
in [5] but the number of iterations required for convergence
becomes large for large images. In the following sections, we
compare and contrast convolutional CoD with our method to
show the performance improvements achieved.
The rest of the paper is organized as follows: section II
describes convolutional FISTA and procedures to learn the
parameters of the model. Also, the extension of the method
for predictive sparse decomposition (PSD) is discussed.
Several experiments are described in section III to show the
performance of the proposed method and compare with the
1
All the variable henceforth represent a matrix, unless otherwise stated.
Also the convolution operators is applied in ‘full’ or ‘valid’ modes, depend-
ing on the context.