6 Y. Bengio
good news is that fo r sparse coding, MAP inference is a convex optimization problem for which
several fast approximations have been proposed (Mairal et al., 2009; Gregor and LeCun, 2010a). It
is interesting to note the results obtained by Coates and Ng (2011) which suggest that sparse coding
is a better encoder but not a better learning algorithm than RBMs and sparse auto-encoders (none of
which has explaining away). Note also that sparse coding can be generalized into the spike-and-slab
sparse coding algorithm (Goodfellow et al., 201 2), in which MAP inference is replaced by variatio nal
inference, and that was used to win the NIPS 2011 transfer lea rning challenge (Goodfellow et al.,
2011).
Another interesting variant on sparse coding is the predictive sparse coding (PSD) algorithm (Kavukcuoglu
et al., 2008) and its variants, which combine properties of sparse coding and of auto-encoders. Sparse
coding can be seen as having only a par ametric “generative” decoder (which maps latent variable
values to visible variable values) and a non-parametr ic encoder (find the latent variables value that
minimizes re construction error and minus the log-prior on the latent variable). PSD adds a para-
metric encoder (just an affine transformation followed by a non-linearity) and learns it jointly with
the generative model, s uch that the output of the parametric encoder is close to the latent variable
values that reconstructs well the input.
3 Scaling Computations
From a computation point of view, how do we scale the recent succ esses of dee p learning to much
larger models and huge datasets, such that the mo de ls are a c tually richer and capture a very la rge
amount of information?
3.1 Scaling Computations: The Challenge
The beginnings of deep learning in 2006 have focused on the MNIST digit image classification
problem (Hinton et al., 2006; Bengio et al., 2007), break ing the supremacy of SVMs (1.4% error )
on this dataset.
8
The latest records are still held by de ep networks: Ciresan et al. (201 2) currently
claim the title of state-of-the-art for the unco nstrained version of the task (e.g., using a convolutional
architecture a nd stochastically deformed data), with 0.27% error.
In the last few years, deep learning has moved from digits to object reco gnition in natur al
images, and the latest breakthrough has been achieved on the Imag eNet dataset.
9
bringing down
the state-of-the-art error rate (out of 5 gues ses) from 26.1% to 15.3% (Krizhevsky et al., 2012)
To achieve the above scaling from 28 ×28 grey-level MNIST images to 256× 256 RGB images,
resear chers have taken advantage o f convolutional architectures (meaning that hidden units do not
need to be connected to all units at the previous layer but only to those in the same spatial a rea,
and that poo ling units reduce the spatial resolution as we move from lowe r to higher layers). They
have also taken advantage of GPU technology to speed-up computation by one or two orders of
magnitude (Raina et al., 200 9; Bergstra et al., 2010, 2011; Krizhevsky et al., 20 12).
We c an expect computational power to continue to increase, mostly through increased parallelism
such as s een in GPUs, multicore machines, and clusters. In addition, computer memor y has become
much more affordable, allowing (a t le ast on CPUs) to handle potentially huge models (in terms of
capacity).
However, where as the tas k of recognizing handwritten digits is solve d to the point of achieving
roughly human-level performance, this is far from true for tasks s uch as g eneral object recognition,
scene understanding, speech recognition, or natural language understanding. What is needed to nail
those tasks and scale to even more a mbitious ones?
8
for the knowledge-free version of the task, where no image-specific prior is used, such as image deformations
or convolutions, where the current state-of-the-art is around 0.8% and involves deep learning (Rifai et al.,
2011b; Hinton et al., 2012b).
9
The 1000-class Im ageNet benchmark, whose results are detailed here:
http://www.image-net.org/challenges/LSVRC/2012/ results.html