深度卷积网络真的需要深且带卷积吗？

需积分: 13 151 浏览量更新于2024-09-09 收藏 354KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

Published as a conference paper at ICLR 2017

been used in speech recognition (e.g. Chan et al. (2015); Geras et al. (2015); Li et al. (2014)) and

reinforcement learning Parisotto et al. (2016); Rusu et al. (2016). Romero et al. (2015) showed that

distillation methods can be used to train small students that are more accurate than the teacher models

by making the student models deeper, but thinner, than the teacher model.

2.2 MIMIC LEARNING VIA L2 REGRESSION ON LOGITS

We train shallow mimic nets using data labeled by an ensemble of deep teacher nets trained on the

original 1-hot CIFAR-10 training data. The deep teacher models are trained in the usual way using

softmax outputs and cross-entropy cost function. Following Ba and Caruana (2014), the student

mimic models are not trained with cross-entropy on the ten

values where

= e

output

by the softmax layer from the deep teacher model, but instead are trained on the un-normalized log

probability values

(the logits) before the softmax activation. Training on the logarithms of predicted

probabilities (logits) helps provide the dark knowledge that regularizes students by placing emphasis

on the relationships learned by the teacher model across all of the outputs.

As in Ba and Caruana (2014), the student is trained as a regression problem given training data

{(x

(1)

, z

(1)

),...,(x

(T )

, z

(T )

)}:

L(W ) =

||g(x

(t)

; W ) − z

(t)

, (1)

where

represents all of the weights in the network, and

g(x

(t)

; W )

is the model prediction on the

training data sample.

2.3 USING A LINEAR BOTTLENECK TO SPEED UP TRAINING

A shallow net has to have more hidden units in each layer to match the number of parameters in

a deep net. Ba and Caruana (2014) found that training these wide, shallow mimic models with

backpropagation was slow, and introduced a linear bottleneck layer between the input and non-linear

layers to speed learning. The bottleneck layer speeds learning by reducing the number of parameters

that must be learned, but does not make the model deeper because the linear terms can be absorbed

back into the non-linear weight matrix after learning. See Ba and Caruana (2014) for details. To match

their experiments we use linear bottlenecks when training student models with 0 or 1 convolutional

layers, but did not ﬁnd the linear bottlenecks necessary when training student models with more than

1 convolutional layer.

2.4 BAYESIAN HYPERPARAMETER OPTIMIZATION

The goal of this work is to determine empirically if shallow nets can be trained to be as accurate as

deep convolutional models using a similar number of parameters in the deep and shallow models. If

we succeed in training a shallow model to be as accurate as a deep convolutional model, this provides

an existence proof that shallow models can represent and learn the complex functions learned by

deep convolutional models. If, however, we are unable to train shallow models to be as accurate as

deep convolutional nets, we might fail only because we did not train the shallow nets well enough.

In all our experiments we employ Bayesian hyperparameter optimization using Gaussian process

regression to ensure that we thoroughly and objectively explore the hyperparameters that govern

learning. The implementation we use is Spearmint (Snoek et al., 2012). The hyperparameters we

optimize with Bayesian optimization include the initial learning rate, momentum, scaling of the initial

random weights, scaling of the inputs, and terms that determine the width of each of the network’s

layers (i.e. number of convolutional ﬁlters and neurons). More details of the hyperparameter

optimization can be found in Sections 2.5, 2.7, 2.8 and in the Appendix.

2.5 TRAINING DATA AND DATA AUGMENTATION

The CIFAR-10 (Krizhevsky, 2009) data set consists of a set of natural images from 10 different object

classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The dataset is a labeled

subset of the 80 million tiny images dataset (Torralba et al., 2008) and is divided into 50,000 train and

剩余12页未读，继续阅读

顿顿304122

粉丝: 0
资源: 42

深度卷积网络真的需要深且带卷积吗？

基于Urbansound8K数据集的环境声识别的方法简述-附件资源

学校考试8K试卷模板

UrbanSound8k数据集.zip

Deep Belief Nets in C++ and CUDA C: Volume 3: Convolutional Nets

deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

(DCGAN)Deep convolutional Generative Adversarial Nets

eaching Deep Convolutional Neural Networks to Play Go

Deep Convolutional Network

Petrographic microfacies classification with deep convolutional

Deep Convolutional Generative Adversarial Networks.md

ImageNet Classification with Deep Convolutional Neural Network

ImageNet Classification with Deep Convolutional Neural Networks

Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for.pdf

Deep Convolutional GAN

deep convolutional gan

最新资源