人工智能之父Hinton顶尖论文_辛顿神经网络

需积分: 4 142 浏览量更新于2023-03-16 评论 2 收藏 330KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

LARGE SCALE DISTRIBUTED NEURAL NETWORK

TRAINING THROUGH ONLINE DISTILLATION

Rohan Anil

Google

rohananil@google.com

Gabriel Pereyra

∗

Google DeepMind

pereyra@google.com

Alexandre Passos

Google Brain

apassos@google.com

Robert Ormandi

Google

ormandi@google.com

George E. Dahl

Google Brain

gdahl@google.com

Geoffrey E. Hinton

Google Brain

geoffhinton@google.com

ABSTRACT

Techniques such as ensembling and distillation promise model quality improve-

ments when paired with almost any base model. However, due to increased test-

time cost (for ensembles) and increased complexity of the training pipeline (for

distillation), these techniques are challenging to use in industrial settings. In this

paper we explore a variant of distillation which is relatively straightforward to use

as it does not require a complicated multi-stage setup or many new hyperparam-

eters. Our ﬁrst claim is that online distillation enables us to use extra parallelism

to ﬁt very large datasets about twice as fast. Crucially, we can still speed up train-

ing even after we have already reached the point at which additional parallelism

provides no beneﬁt for synchronous or asynchronous stochastic gradient descent.

Two neural networks trained on disjoint subsets of the data can share knowledge

by encouraging each model to agree with the predictions the other model would

have made. These predictions can come from a stale version of the other model so

they can be safely computed using weights that only rarely get transmitted. Our

second claim is that online distillation is a cost-effective way to make the exact

predictions of a model dramatically more reproducible. We support our claims

using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the

largest to-date dataset used for neural language modeling, containing 6 × 10

tokens and based on the Common Crawl repository of web data.

1 INTRODUCTION

For large-scale, commercially valuable neural net training problems, practitioners would be will-

ing to devote many more machines to training if it sped up training time dramatically or improved

the quality of the ﬁnal model. Currently, distributed stochastic gradient descent (SGD), in both its

synchronous and asynchronous forms (Chen et al., 2016), is the dominant algorithm for large-scale

neural network training across multiple interconnected machines. Unfortunately, as the number of

machines increases, there are diminishing improvements to the time needed to train a high quality

model, to a point where adding workers does not further improve training time. A combination of

infrastructure limitations and optimization barriers constrain the scalability of distributed minibatch

SGD. The overhead of communicating weight updates and the long tail of the machine and network

latency distributions slow down execution and produce thorny engineering challenges. For the syn-

chronous algorithm, there are rapidly diminishing returns from increasing the effective batch size

(LeCun et al., 2012; Keskar et al., 2017). For the asynchronous algorithm, gradient interference

from inconsistent weights can cause updates to thrash and even, in some cases, result in worse ﬁnal

accuracy or completely stall learning progress. The precise scalability limit for distributed SGD will

depend on implementation details of the algorithm, speciﬁcs of the infrastructure, and the capabili-

ties of the hardware, but in our experience it can be very difﬁcult to scale effectively much beyond

∗

Work completed while G. Pereyra was a Google Brain resident.

arXiv:1804.03235v1 [cs.LG] 9 Apr 2018

a hundred GPU workers in realistic setups. No algorithm for training neural nets will be inﬁnitely

scalable, but even scaling a bit beyond the limits of distributed SGD would be extremely valuable.

Once we have reached the limits of adding workers to distributed SGD, we could instead use extra

machines to train another copy of the model and create an ensemble to improve accuracy (or trade

this accuracy for training time by training the members of the ensemble for fewer steps). As an added

beneﬁt, the ensemble will make more stable and reproducible predictions, which can be useful in

practical applications. However, ensembling increases the cost at test time, potentially violating

latency or other cost constraints. Alternatively, to get nearly the same beneﬁts of the ensemble

without increasing test time costs, we can distill (Hinton et al., 2015; Bucila et al., 2006) an n-way

ensemble of models into a single still-servable model using a two-phase process: ﬁrst we use nM

machines to train an n-way ensemble with distributed SGD and then use M machines to train the

servable student network to mimic the n-way ensemble. By adding another phase to the training

process and using more machines, distillation in general increases training time and complexity in

return for a quality improvement close to the larger teacher ensemble model.

We believe that the additional training costs, in terms of both time and pipeline complexity, dis-

courage practitioners from using ensemble distillation, even though it almost always would improve

results. In this work, we describe a simpler online variant of distillation we call codistillation. Codis-

tillation trains n copies of a model in parallel by adding a term to the loss function of the ith model

to match the average prediction of the other models.

Through large-scale experiments we show that, compared to distributed SGD, codistillation im-

proves accuracy and speeds up training by allowing the productive use of more computational re-

sources even beyond the point where adding more workers provides no additional speedup for SGD.

Speciﬁcally, codistillation provides the beneﬁts of distilling an ensemble of models without increas-

ing training time. Codistillation is also quite simple to use in practice compared to a multi-phase

distillation training procedure. Multi-phase distillation tends to encourage human intervention be-

tween the training phases to decide when to stop training the ensemble and start distilling it into a

single model. We also show that codistillation does not lose the reproducibility beneﬁts of ensem-

bles of neural networks, reducing churn in the predictions of different retrains of the same model.

Reducing prediction churn can be essential when testing and launching new versions of a model in a

non-disruptive way in an existing service, although it is not as well-studied in the academic machine

learning community.

Given the obvious relationship to distillation, very similar algorithms to codistillation have been in-

dependently described by multiple researchers. For example, Zhang et al. (2017) describes another

simultaneous distillation algorithm but does not investigate the beneﬁts in the distributed training

case and only presents it as a potential quality improvement over regular distillation. We view the

experimental validation of codistillation at scale as the primary contribution of our work. Another

contribution of this work is our exploration of different design choices and implementation consid-

erations for codistillation which we believe has produced recommendations of substantial practical

utility.

In general, we believe the quality gains of codistillation over well-tuned ofﬂine distillation will be

minor in practice and the more interesting research direction is exploring codistillation as a dis-

tributed training algorithm that uses an additional form of communication that is far more delay

tolerant.

1.1 RELATED WORK

In addition to the closely related work in Hinton et al. (2015) and Zhang et al. (2017) mentioned

above, there are many different tactics for scaling up neural network training. Early work in training

large distributed neural networks focused on schemes for partitioning networks over multiple cores,

often referred to as model parallelism (Dean et al., 2012). As memory has increased on graphic

processing units (GPUs), the majority of distributed training has shifted towards data parallelism,

where the model is replicated across multiple machines and data are distributed to the different

replicas, with updates being merged by parameter servers or a single allreduce step as in Goyal et al.

(2017). Even without a high quality allreduce primitive, variants of centralized synchronous SGD

with backup workers can scale to a large number of machines (Chen et al., 2016).

Methods like ensembling and distillation are mostly orthogonal to lower level distributed training

infrastructure. However, mixture of experts models have particularly natural model parallelism that

can be integrated with data parallelism and a synchronous training scheme. Gross et al. (2017) and

Shazeer et al. (2017) are notable examples of recent work in this area.

As researchers try to scale neural network training to ever larger datasets and models, the optimiza-

tion algorithm itself can be altered. For synchronous SGD there are rapidly diminishing returns

(LeCun et al., 2012; Keskar et al., 2017) as the number of workers, and thus the effective batch size,

increases and we might hope that algorithms like KFAC (Ba et al., 2017) would make better use of

large batches. Although a promising direction for research, in this work we focus on what should

hopefully be an optimization algorithm agnostic way to improve scalability and reproducibility.

2 CODISTILLATION

Distillation is a meta-algorithm which allows any algorithm to incorporate some of the model quality

beneﬁts of ensembles. The idea of distillation is to ﬁrst train a teacher model, which traditionally

is an ensemble or another high-capacity model, and then, once this teacher model is trained, train

a student model with an additional term in the loss function which encourages its predictions to be

similar to the predictions of the teacher model.

There are many variants of distillation, for different types of teacher model, different types of loss

function, and different choices for what dataset the student model trains on. For example, the student

model could be trained on a large unlabeled dataset, on a held-out data set, or even on the original

training set.

Perhaps surprisingly, distillation has beneﬁts even if the teacher model and the student model are

two instances of the same neural network (see section 3 for empirical evidence), as long as they are

sufﬁciently different (say, by having different initializations and seeing the examples in a different

order). Furthermore, the teacher model predictions are still beneﬁcial to the student model even

before convergence. Finally, the distinction between teacher and student is unnecessary and two or

more models all distilling from each other can also be useful.

In this paper, we use codistillation to refer to distillation performed:

1. using the same architecture for all the models;

2. using the same dataset to train all the models; and

3. using the distillation loss during training before any model has fully converged.

For simplicity, we usually consider the case where all models have a distillation term in their loss

function, but the key characteristic of codistillation is the simultaneous training of a model and its

teacher.

Algorithm 1 presents the codistillation algorithm. The distillation loss term ψ can be the squared

error between the logits of the models, the KL divergence between the predictive distributions, or

some other measure of agreement between the model predictions. In this work we use the cross

entropy error treating the teacher predictive distribution as soft targets. In the beginning of training,

the distillation term in the loss is not very useful or may even be counterproductive, so to main-

tain model diversity longer and to avoid a complicated loss function schedule we only enable the

distillation term in the loss function once training has gotten off the ground.

2.1 CODISTILLATION AS A DISTRIBUTED NEURAL NETWORK TRAINING ALGORITHM

In order to scale beyond the limits of distributed stochastic gradient descent we will need an algo-

rithm that is far more communication efﬁcient. As seen in Algorithm 1, to update the parameters of

one network using codistillation one only needs the predictions of the other networks, which can be

computed locally from copies of the other networks weights.

There are several reasons to believe that stale predictions might be much less of a problem than stale

gradients for training:

剩余11页未读，继续阅读

laotanwly

粉丝: 0
资源: 2

会员权益专享

人工智能之父Hinton顶尖论文

评论0

会员权益专享

最新资源

人工智能之父Hinton顶尖论文

评论0

人工智能算法论文

深度学习Geoffrey Hinton经典书籍

06年Hinton训练深度置信网的方法wake－sleep

自“人工智能教父”Hinton等人[5]提出了深度学习理论后

人工智能行业有那些伟大人物

自“人工智能教父”Hinton等人[11]提出了深度学习理论,深度学习的浪潮随之而来。相对于传统的机器学习算法

自“人工智能教父”Hinton等人[5]提出了深度学习理论后, 深度学习技术得到了广泛应用和发展。相对于传统的机器学习算法

人工智能算法和天线设计的视频课程、经典论文、以及教材推荐

1.内容：选取理工校史、学科专业史某一视角，如可以阐释老校长王大珩精神、理工校训、理工建校史、人工智能专业发展、人工智能领域某一有卓越贡献人物等都可以。

列举三个人工智能方面的专家以及他们的研究方向和成果，论述他们的优势和劣势，

生成式人工智能国外学者研究现状

给我几篇关于人工智能与图像识别的文献

全世界公认 的AI专家有哪些

人工智能深度学习参考文献

卷积神经网络经典论文

2006年Hinton在Reducing the dimensionality of data with neural networks中首次提出了深度学习的概念吗？

Hinton首次提出深度学习的概念是在Reduc the dimensionality of data with neural networks还是 a fast learning algorithm for deep belief nets中？

稀疏/最先提出深度学习算法hinton的自动编码器matlab源代码

如何从https://www.cs.toronto.edu/~hinton/下载fmincg函数

如何从https://www.cs.toronto.edu/~hinton/下载函数

会员权益专享

最新资源

全世界公认的AI专家有哪些