没有合适的资源?快使用搜索试试~ 我知道了~
首页人工智能之父Hinton顶尖论文
Geoffrey Hinton,被称为“神经网络之父”、“深度学习鼻祖”,他曾获得爱丁堡大学人工智能的博士学位,并且为多伦多大学的特聘教授。在2012年,Hinton还获得了加拿大基廉奖(Killam Prizes,有“加拿大诺贝尔奖”之称的国家最高科学奖)。2013年,Hinton 加入谷歌并带领一个AI团队,他将神经网络带入到研究与应用的热潮,将“深度学习”从边缘课题变成了谷歌等互联网巨头仰赖的核心技术,并将HintonBack Propagation(反向传播)算法应用到神经网络与深度学习。
资源详情
资源评论
资源推荐
LARGE SCALE DISTRIBUTED NEURAL NETWORK
TRAINING THROUGH ONLINE DISTILLATION
Rohan Anil
Google
rohananil@google.com
Gabriel Pereyra
∗
Google DeepMind
pereyra@google.com
Alexandre Passos
Google Brain
apassos@google.com
Robert Ormandi
Google
ormandi@google.com
George E. Dahl
Google Brain
gdahl@google.com
Geoffrey E. Hinton
Google Brain
geoffhinton@google.com
ABSTRACT
Techniques such as ensembling and distillation promise model quality improve-
ments when paired with almost any base model. However, due to increased test-
time cost (for ensembles) and increased complexity of the training pipeline (for
distillation), these techniques are challenging to use in industrial settings. In this
paper we explore a variant of distillation which is relatively straightforward to use
as it does not require a complicated multi-stage setup or many new hyperparam-
eters. Our first claim is that online distillation enables us to use extra parallelism
to fit very large datasets about twice as fast. Crucially, we can still speed up train-
ing even after we have already reached the point at which additional parallelism
provides no benefit for synchronous or asynchronous stochastic gradient descent.
Two neural networks trained on disjoint subsets of the data can share knowledge
by encouraging each model to agree with the predictions the other model would
have made. These predictions can come from a stale version of the other model so
they can be safely computed using weights that only rarely get transmitted. Our
second claim is that online distillation is a cost-effective way to make the exact
predictions of a model dramatically more reproducible. We support our claims
using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the
largest to-date dataset used for neural language modeling, containing 6 × 10
11
tokens and based on the Common Crawl repository of web data.
1 INTRODUCTION
For large-scale, commercially valuable neural net training problems, practitioners would be will-
ing to devote many more machines to training if it sped up training time dramatically or improved
the quality of the final model. Currently, distributed stochastic gradient descent (SGD), in both its
synchronous and asynchronous forms (Chen et al., 2016), is the dominant algorithm for large-scale
neural network training across multiple interconnected machines. Unfortunately, as the number of
machines increases, there are diminishing improvements to the time needed to train a high quality
model, to a point where adding workers does not further improve training time. A combination of
infrastructure limitations and optimization barriers constrain the scalability of distributed minibatch
SGD. The overhead of communicating weight updates and the long tail of the machine and network
latency distributions slow down execution and produce thorny engineering challenges. For the syn-
chronous algorithm, there are rapidly diminishing returns from increasing the effective batch size
(LeCun et al., 2012; Keskar et al., 2017). For the asynchronous algorithm, gradient interference
from inconsistent weights can cause updates to thrash and even, in some cases, result in worse final
accuracy or completely stall learning progress. The precise scalability limit for distributed SGD will
depend on implementation details of the algorithm, specifics of the infrastructure, and the capabili-
ties of the hardware, but in our experience it can be very difficult to scale effectively much beyond
∗
Work completed while G. Pereyra was a Google Brain resident.
1
arXiv:1804.03235v1 [cs.LG] 9 Apr 2018
a hundred GPU workers in realistic setups. No algorithm for training neural nets will be infinitely
scalable, but even scaling a bit beyond the limits of distributed SGD would be extremely valuable.
Once we have reached the limits of adding workers to distributed SGD, we could instead use extra
machines to train another copy of the model and create an ensemble to improve accuracy (or trade
this accuracy for training time by training the members of the ensemble for fewer steps). As an added
benefit, the ensemble will make more stable and reproducible predictions, which can be useful in
practical applications. However, ensembling increases the cost at test time, potentially violating
latency or other cost constraints. Alternatively, to get nearly the same benefits of the ensemble
without increasing test time costs, we can distill (Hinton et al., 2015; Bucila et al., 2006) an n-way
ensemble of models into a single still-servable model using a two-phase process: first we use nM
machines to train an n-way ensemble with distributed SGD and then use M machines to train the
servable student network to mimic the n-way ensemble. By adding another phase to the training
process and using more machines, distillation in general increases training time and complexity in
return for a quality improvement close to the larger teacher ensemble model.
We believe that the additional training costs, in terms of both time and pipeline complexity, dis-
courage practitioners from using ensemble distillation, even though it almost always would improve
results. In this work, we describe a simpler online variant of distillation we call codistillation. Codis-
tillation trains n copies of a model in parallel by adding a term to the loss function of the ith model
to match the average prediction of the other models.
Through large-scale experiments we show that, compared to distributed SGD, codistillation im-
proves accuracy and speeds up training by allowing the productive use of more computational re-
sources even beyond the point where adding more workers provides no additional speedup for SGD.
Specifically, codistillation provides the benefits of distilling an ensemble of models without increas-
ing training time. Codistillation is also quite simple to use in practice compared to a multi-phase
distillation training procedure. Multi-phase distillation tends to encourage human intervention be-
tween the training phases to decide when to stop training the ensemble and start distilling it into a
single model. We also show that codistillation does not lose the reproducibility benefits of ensem-
bles of neural networks, reducing churn in the predictions of different retrains of the same model.
Reducing prediction churn can be essential when testing and launching new versions of a model in a
non-disruptive way in an existing service, although it is not as well-studied in the academic machine
learning community.
Given the obvious relationship to distillation, very similar algorithms to codistillation have been in-
dependently described by multiple researchers. For example, Zhang et al. (2017) describes another
simultaneous distillation algorithm but does not investigate the benefits in the distributed training
case and only presents it as a potential quality improvement over regular distillation. We view the
experimental validation of codistillation at scale as the primary contribution of our work. Another
contribution of this work is our exploration of different design choices and implementation consid-
erations for codistillation which we believe has produced recommendations of substantial practical
utility.
In general, we believe the quality gains of codistillation over well-tuned offline distillation will be
minor in practice and the more interesting research direction is exploring codistillation as a dis-
tributed training algorithm that uses an additional form of communication that is far more delay
tolerant.
1.1 RELATED WORK
In addition to the closely related work in Hinton et al. (2015) and Zhang et al. (2017) mentioned
above, there are many different tactics for scaling up neural network training. Early work in training
large distributed neural networks focused on schemes for partitioning networks over multiple cores,
often referred to as model parallelism (Dean et al., 2012). As memory has increased on graphic
processing units (GPUs), the majority of distributed training has shifted towards data parallelism,
where the model is replicated across multiple machines and data are distributed to the different
replicas, with updates being merged by parameter servers or a single allreduce step as in Goyal et al.
(2017). Even without a high quality allreduce primitive, variants of centralized synchronous SGD
with backup workers can scale to a large number of machines (Chen et al., 2016).
2
Methods like ensembling and distillation are mostly orthogonal to lower level distributed training
infrastructure. However, mixture of experts models have particularly natural model parallelism that
can be integrated with data parallelism and a synchronous training scheme. Gross et al. (2017) and
Shazeer et al. (2017) are notable examples of recent work in this area.
As researchers try to scale neural network training to ever larger datasets and models, the optimiza-
tion algorithm itself can be altered. For synchronous SGD there are rapidly diminishing returns
(LeCun et al., 2012; Keskar et al., 2017) as the number of workers, and thus the effective batch size,
increases and we might hope that algorithms like KFAC (Ba et al., 2017) would make better use of
large batches. Although a promising direction for research, in this work we focus on what should
hopefully be an optimization algorithm agnostic way to improve scalability and reproducibility.
2 CODISTILLATION
Distillation is a meta-algorithm which allows any algorithm to incorporate some of the model quality
benefits of ensembles. The idea of distillation is to first train a teacher model, which traditionally
is an ensemble or another high-capacity model, and then, once this teacher model is trained, train
a student model with an additional term in the loss function which encourages its predictions to be
similar to the predictions of the teacher model.
There are many variants of distillation, for different types of teacher model, different types of loss
function, and different choices for what dataset the student model trains on. For example, the student
model could be trained on a large unlabeled dataset, on a held-out data set, or even on the original
training set.
Perhaps surprisingly, distillation has benefits even if the teacher model and the student model are
two instances of the same neural network (see section 3 for empirical evidence), as long as they are
sufficiently different (say, by having different initializations and seeing the examples in a different
order). Furthermore, the teacher model predictions are still beneficial to the student model even
before convergence. Finally, the distinction between teacher and student is unnecessary and two or
more models all distilling from each other can also be useful.
In this paper, we use codistillation to refer to distillation performed:
1. using the same architecture for all the models;
2. using the same dataset to train all the models; and
3. using the distillation loss during training before any model has fully converged.
For simplicity, we usually consider the case where all models have a distillation term in their loss
function, but the key characteristic of codistillation is the simultaneous training of a model and its
teacher.
Algorithm 1 presents the codistillation algorithm. The distillation loss term ψ can be the squared
error between the logits of the models, the KL divergence between the predictive distributions, or
some other measure of agreement between the model predictions. In this work we use the cross
entropy error treating the teacher predictive distribution as soft targets. In the beginning of training,
the distillation term in the loss is not very useful or may even be counterproductive, so to main-
tain model diversity longer and to avoid a complicated loss function schedule we only enable the
distillation term in the loss function once training has gotten off the ground.
2.1 CODISTILLATION AS A DISTRIBUTED NEURAL NETWORK TRAINING ALGORITHM
In order to scale beyond the limits of distributed stochastic gradient descent we will need an algo-
rithm that is far more communication efficient. As seen in Algorithm 1, to update the parameters of
one network using codistillation one only needs the predictions of the other networks, which can be
computed locally from copies of the other networks weights.
There are several reasons to believe that stale predictions might be much less of a problem than stale
gradients for training:
3
剩余11页未读,继续阅读
laotanwly
- 粉丝: 0
- 资源: 2
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- 2023年中国辣条食品行业创新及消费需求洞察报告.pptx
- 2023年半导体行业20强品牌.pptx
- 2023年全球电力行业评论.pptx
- 2023年全球网络安全现状-劳动力资源和网络运营的全球发展新态势.pptx
- 毕业设计-基于单片机的液体密度检测系统设计.doc
- 家用清扫机器人设计.doc
- 基于VB+数据库SQL的教师信息管理系统设计与实现 计算机专业设计范文模板参考资料.pdf
- 官塘驿林场林防火(资源监管)“空天地人”四位一体监测系统方案.doc
- 基于专利语义表征的技术预见方法及其应用.docx
- 浅谈电子商务的现状及发展趋势学习总结.doc
- 基于单片机的智能仓库温湿度控制系统 (2).pdf
- 基于SSM框架知识产权管理系统 (2).pdf
- 9年终工作总结新年计划PPT模板.pptx
- Hytera海能达CH04L01 说明书.pdf
- 数据中心运维操作标准及流程.pdf
- 报告模板 -成本分析与报告培训之三.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0