Published as a conference paper at ICLR 2017
been used in speech recognition (e.g. Chan et al. (2015); Geras et al. (2015); Li et al. (2014)) and
reinforcement learning Parisotto et al. (2016); Rusu et al. (2016). Romero et al. (2015) showed that
distillation methods can be used to train small students that are more accurate than the teacher models
by making the student models deeper, but thinner, than the teacher model.
2.2 MIMIC LEARNING VIA L2 REGRESSION ON LOGITS
We train shallow mimic nets using data labeled by an ensemble of deep teacher nets trained on the
original 1-hot CIFAR-10 training data. The deep teacher models are trained in the usual way using
softmax outputs and cross-entropy cost function. Following Ba and Caruana (2014), the student
mimic models are not trained with cross-entropy on the ten
p
values where
p
k
= e
z
k
/
P
j
e
z
j
output
by the softmax layer from the deep teacher model, but instead are trained on the un-normalized log
probability values
z
(the logits) before the softmax activation. Training on the logarithms of predicted
probabilities (logits) helps provide the dark knowledge that regularizes students by placing emphasis
on the relationships learned by the teacher model across all of the outputs.
As in Ba and Caruana (2014), the student is trained as a regression problem given training data
{(x
(1)
, z
(1)
),...,(x
(T )
, z
(T )
)}:
L(W ) =
1
T
X
t
||g(x
(t)
; W ) − z
(t)
||
2
2
, (1)
where
W
represents all of the weights in the network, and
g(x
(t)
; W )
is the model prediction on the
t
th
training data sample.
2.3 USING A LINEAR BOTTLENECK TO SPEED UP TRAINING
A shallow net has to have more hidden units in each layer to match the number of parameters in
a deep net. Ba and Caruana (2014) found that training these wide, shallow mimic models with
backpropagation was slow, and introduced a linear bottleneck layer between the input and non-linear
layers to speed learning. The bottleneck layer speeds learning by reducing the number of parameters
that must be learned, but does not make the model deeper because the linear terms can be absorbed
back into the non-linear weight matrix after learning. See Ba and Caruana (2014) for details. To match
their experiments we use linear bottlenecks when training student models with 0 or 1 convolutional
layers, but did not find the linear bottlenecks necessary when training student models with more than
1 convolutional layer.
2.4 BAYESIAN HYPERPARAMETER OPTIMIZATION
The goal of this work is to determine empirically if shallow nets can be trained to be as accurate as
deep convolutional models using a similar number of parameters in the deep and shallow models. If
we succeed in training a shallow model to be as accurate as a deep convolutional model, this provides
an existence proof that shallow models can represent and learn the complex functions learned by
deep convolutional models. If, however, we are unable to train shallow models to be as accurate as
deep convolutional nets, we might fail only because we did not train the shallow nets well enough.
In all our experiments we employ Bayesian hyperparameter optimization using Gaussian process
regression to ensure that we thoroughly and objectively explore the hyperparameters that govern
learning. The implementation we use is Spearmint (Snoek et al., 2012). The hyperparameters we
optimize with Bayesian optimization include the initial learning rate, momentum, scaling of the initial
random weights, scaling of the inputs, and terms that determine the width of each of the network’s
layers (i.e. number of convolutional filters and neurons). More details of the hyperparameter
optimization can be found in Sections 2.5, 2.7, 2.8 and in the Appendix.
2.5 TRAINING DATA AND DATA AUGMENTATION
The CIFAR-10 (Krizhevsky, 2009) data set consists of a set of natural images from 10 different object
classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The dataset is a labeled
subset of the 80 million tiny images dataset (Torralba et al., 2008) and is divided into 50,000 train and
3