没有合适的资源?快使用搜索试试~ 我知道了~
首页全球关于SGD并行的首篇论文
资源详情
资源评论
资源推荐

Slow Learners are Fast
John Langford, Alexander J. Smola, Martin Zinkevich
Machine Learning, Yahoo! Labs and Australian National University
4401 Great America Pky, Santa Clara, 95051 CA
{jl, maz, smola}@yahoo-inc.com
Abstract
Online learning algorithms have impressive convergence properties when it comes
to risk minimization and convex games on very large problems. However, they are
inherently sequential in their design which prevents them from taking advantage
of modern multi-core architectures. In this paper we prove that online learning
with delayed updates converges well, thereby facilitating parallel online learning.
1 Introduction
Online learning has become the paradigm of choice for tackling very large scale estimation prob-
lems. The convergence properties are well understood and have been analyzed in a number of differ-
ent frameworks such as by means of asymptotics [12], game theory [8], or stochastic programming
[13]. Moreover, learning-theory guarantees show that O(1) passes over a dataset suffice to obtain
optimal estimates [3, 2]. This suggests that online algorithms are an excellent tool for learning.
This view, however, is slightly deceptive for several reasons: current online algorithms process
one instance at a time. That is, they receive the instance, make some prediction, incur a loss, and
update an associated parameter. In other words, the algorithms are entirely sequential in their nature.
While this is acceptable in single-core processors, it is highly undesirable given that the number
of processing elements available to an algorithm is growing exponentially (e.g. modern desktop
machines have up to 8 cores, graphics cards up to 1024 cores). It is therefore very wasteful if only
one of these cores is actually used for estimation.
A second problem arises from the fact that network and disk I/O have not been able to keep up with
the increase in processor speed. A typical network interface has a throughput of 100MB/s and disk
arrays have comparable parameters. This means that current algorithms reach their limit at problems
of size 1TB whenever the algorithm is I/O bound (this amounts to a training time of 3 hours), or even
smaller problems whenever the model parametrization makes the algorithm CPU bound.
Finally, distributed and cloud computing are unsuitable for today’s online learning algorithms. This
creates a pressing need to design algorithms which break the sequential bottleneck. We propose two
variants. To our knowledge, this is the first paper which provides theoretical guarantees combined
with empirical evidence for such an algorithm. Previous work, e.g. by [6] proved rather inconclusive
in terms of theoretical and empirical guarantees.
In a nutshell, we propose the following two variants: several processing cores perform stochastic
gradient descent independently of each other while sharing a common parameter vector which is
updated asynchronously. This allows us to accelerate computationally intensive problems when-
ever gradient computations are relatively expensive. A second variant assumes that we have linear
function classes where parts of the function can be computed independently on several cores. Sub-
sequently the results are combined and the combination is then used for a descent step.
A common feature of both algorithms is that the update occurs with some delay: in the first case
other cores may have updated the parameter vector in the meantime, in the second case, other cores
may have already computed parts of the function for the subsequent examples before an update.
1









weixin_43027856
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
我的内容管理 收起
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助

会员权益专享
安全验证
文档复制为VIP权益,开通VIP直接复制

评论0