没有合适的资源?快使用搜索试试~ 我知道了~
Sideways: Depth-Parallel Training of Video ModelsMateusz Malinowskimateuszm@google.comGrzegorz ´Swirszczswirszcz@google.comJo˜ao Carreirajoaoluis@google.comViorica P˘atr˘auceanviorica@google.comDeepMindLondon, U.K.AbstractWe propose Sideways, an approximate backpropagationscheme for training video models. In standard backpropa-gation, the gradients and activations at every computationstep through the model are temporally synchronized. Theforward activations need to be stored until the backwardpass is executed, preventing inter-layer (depth) paralleliza-tion. However, can we leverage smooth, redundant inputstreams such as videos to develop a more efficient trainingscheme? Here, we explore an alternative to backpropaga-tion; we overwrite network activations whenever new ones,i.e., from new frames, become available. Such a more grad-ual accumulation of information from both passes breaksthe precise correspondence between gradients and activa-tions, leading to theoretically more noisy weight updates.Counter-intuitively, we show that Sideways training of deepconvolutional video networks not only still converges, butcan also potentially exhibit better generalization comparedto standard synchronized backpropagation.1. IntroductionThe key ingredient of deep learning is stochastic gradi-ent descent (SGD) [7, 42, 53], which has many variants,including SGD with Momentum [47], Adam [26], and Ada-grad [14]. E.g., SGD approximates gradients using mini-batches sampled from full datasets. Efficiency considera-tions primarily motivated the development of SGD as manydatasets do not fit in memory. Moreover, computing fullgradients over them would take a long time, compared tomini-batches, i.e., performing SGD steps is often more pre-ferred [7, 16, 53]. However, SGD is not only more efficientbut also produces better models. E.g, giant-sized modelstrained using SGD are naturally regularized and may gener-alize better [18, 43], and local minima do not seem to be aproblem [11]. Explaining these phenomena is still an opentheoretical problem, but it is clear that SGD is doing morethan merely optimizing a given loss function [52].tt+15t+30Figure 1: Three frames of a fish swimming, sampled 15frames apart, or about every half a second. Note how littlevariation there is in the patch within the red square. Canwe leverage such redundancies and the smoothness in lo-cal neighborhoods of such type of data for more efficienttraining? Our results suggest we can and there could begeneralization benefits in doing that.In this paper, we propose a further departure from thegradient descent, also motivated by efficiency considera-tions, which trains models that operate on sequences ofvideo frames. Gradients of neural networks are computedusing the backpropagation (BP) algorithm. However, BPoperates in a synchronized blocking fashion: first, activa-tions for a mini-batch are computed and stored during theforward pass, and next, these activations are re-used to com-pute Jacobian matrices in the backward pass. Such blockingmeans that the two passes must be done sequentially, whichleads to high latency, low throughput. This is particularlysub-optimal if there are parallel processing resources avail-able, and is particularly prominent if we cannot parallelizeacross batch or temporal dimensions, e.g., in online learningor with causal models.The central hypothesis studied in this paper is whetherwe can backpropagate gradients based on activations fromdifferent timesteps, hence removing the locking betweenthe layers. Intuitively, one reason this may work is that highframe rate videos are temporally smooth, leading to similar11834representations of neighboring frames, which is illustratedin Figure 1.We experiment with two types of tasks that have differ-ent requirements in terms of latency: a per-sequence actionrecognition, and a per-frame autoencoding. In both cases,our models do not use any per-frame blocking during theforward or backward passes. We call the resulting gradientupdate procedure Sideways, owing to the shape of the dataflow, shown in Figure 2.In experiments on action recognition, UCF101 [46] andHMDB51 [29], we have found that training with Sidewaysnot only does not diverge but often has led to improved per-formance over BP models, providing a surprising regular-ization effect. Such training dynamics create a new lineof inquiry into the true nature of the success of SGD, asit shows that it is also not critical to have precise alignmentbetween activations and gradients. Additionally, we showthat Sideways provides a nearly linear speedup in trainingwith depth parallelism on multiple GPUs compared to a BPmodel using the same resources. We believe that this re-sult also opens up possibilities for training models at higherframe rates in online settings, e.g., where parallelizationacross mini-batches is not an option.We use per-frame autoencoding task to investigate the ef-fect of the blocking mechanism of BP models in tasks wherethe input stream cannot be buffered or where we require im-mediate responses. This is particularly problematic for BPif the input stream is quickly evolving, i.e., the input changerate is higher than the time required to process the per-stepinput. In this case, the blocking mechanism of BP will re-sult in discarding the new inputs received while the modelis being blocked processing the previous input. However,this is considerably less problematic in Sideways due to itslock-free mechanism. We run experiments on syntheticallygenerated videos from the CATER dataset [15], where weobserve that Sideways outperforms the BP baseline.2. Related WorkOur work connects with different strands of researcharound backpropagation, parallelization and video mod-elling. We list here a few of the most relevant examples.Alternatives to backpropagation. Prior work has shownthat various modifications of the ‘mathematically correct’backpropagation can actually lead to satisfactory training.For instance, some relaxations of backpropagation imple-mented with a fixed random matrix yield a surprisinglygood performance on MNIST [31]. There is also a recentgrowing interest in building more biologically-plausible ormodel-parallel approaches to train networks. This includesFeedback Alignment [31], Direct Feedback Alignment [37],Target Propagation [5], Kickback [2], Online AM [10], Fea-tures Replay [21], Decoupled Features Replay [3], and Syn-thetic Gradients [23], where various decouplings betweenforward or backward pass are proposed. A good compara-tive overview of those frameworks is presented in [12]. An-other recent innovative idea is to meta-learn local rules forgradient updates [34], or to use either self-supervised tech-niques [39] or local losses to perform gradient-isolated up-dates locally [32, 38]. Asynchronous distributed SGD ap-proaches like Hogwild [41] also do not strictly fit into cleanbackprop as they allow multiple workers to partially over-write each others weight updates, but provide some theoret-ical guarantees as long as these overwrites are sparse. How-ever, most of these prior works are applied to visually sim-pler domains, some require buffering activations over manytraining steps, or investigate local communication only. Incontrast, here, we take advantage of the smoothness of tem-poral data. Moreover, we investigate a global, top-down,and yet asynchronous communication between the layers ofa neural network during its training without buffering acti-vations over longer period and without auxiliary networksor losses. This view is consistent with some mathematicalmodels of cortex [6, 28, 30, 48]. We also address forwardand backward locking for temporal models. Finally, most ofthe works above can also potentially be used together withour Sideways training, which we leave as a possible futuredirection.Large models. Parallelism has grown in importance dueto the success of gigantic neural networks with billionsof parameters [49], potentially having high-resolution in-puts [40], that cannot fit into individual GPUs. Approachessuch as GPipe [20] or DDG [22] show that efficient pipelin-ing strategies can be used to decouple the forward andbackward passes by buffering activations at different layers,which then enables the parallel execution of different layersof the network. Similarly, multiple modules of the networkcan be processed simultaneously on activations belongingto different mini-batches [22]. Such pipelining reduces thetraining time for image models but at the cost of increasedmemory footprint.Efficient video processing. Conditional computation [4] orhard-attention approaches can increase efficiency [33, 35]when dealing with large data streams. These are, however,generic approaches that do not exploit the temporal smooth-ness of sequential data such as video clips [50]. For video,sampling key frames is shown to be a quite powerful mech-anism when performing classification [27, 51], but may notbe appropriate if a more detailed temporal representationof the input sequence is needed [15]. Recently, a deep de-coupled video model [8] has been proposed that achieveshigh throughput and speed at inference time, while preserv-ing the accuracy of sequential models. However, [8] usesregular backprop, and hence does not benefit from paral-lelization fully, i.e., backprop still blocks the computations,and requires buffering activations during the forward pass.11835In this paper, we build upon [8] that uses parallel inference,but go further and make both inference and learning depth-parallel. Note that, if we only consider inference, Sidewaysreduces to [8].3. SidewaysIn this section, we define the formulation of our problemand formalize both algorithms: BP and Sideways.3.1. Notation and DefinitionsWe consider the following general setting:• a finite input time-series x = (xt)Kt=1, xt ∈ Rd, e.g.,a video clip with d = height × width × 3,• a finite output time-series y = (yt)Kt=1, yt ∈ Rdy,e.g., an action label; in the action recognition task, inour work, we use the same label over the whole videoclip, i.e., yt = yt+1 for all t,• a frame-based neural network Mθ : Rd → Rdythat transforms the input signal xt into logits htD =Mθ(xt), and is defined by a composition of modulesMθ(xt) = HD(·, θD)◦HD−1(·, θD−1)◦. . .◦H1(xt, θ1)where:– each module, or layer, Hi(·, ·) is a function Hi :Rdi−1 × Rpi → Rdi, i = 1, . . . D,– θi ∈ Rpi, i = 1, . . . , D are the (trainable) pa-rameters, and we use θ for all the parameters,– ◦ is a composition, i.e., G ◦ F(x) = G(F(x))and• a loss function L : Rdy × Rdy → R, e.g., L(h, y) =||h − y||2, or L(h, y) = − �i p((h)i) log q(yi).We extend the notation above to hti=Hi(·, θi) ◦Hi−1(·, θi−1) ◦ . . . ◦ H1(xt, θ1).To avoid the common confusion coming from using thesame letters to denote both the function formal argumentsand actual values of the variables, we will use bold fontfor the latter, e.g., x to denote a formal argument and xfor its actual value.We also use the following notationfor the derivatives of the functions Hi. Let JhH(h, θ) =∂H(h,θ)∂h���h=h be the Jacobian matrix of H(h, θ) with re-spect to the variable h evaluated at h = h, θ = θ. Similarly,JθH(h, θ) =∂H(h,θ)∂θ���θ=θ denote the Jacobian matrix ofH(h, θ) with respect to the variable θ evaluated at h = h,θ = θ. We will use the same notation for the gradient ∇.Finally, to train neural networks, we base our com-putations on the empirical risk minimization frame-work,i.e.R(Mθ)=Ex,y[L(Mθ(x), y)]≈�x,y∼D1K�Kt=1 L(htD, yt), where D is a training set.3.2. Update CycleFor simplicity, we assume in our modelling a constanttime for a layer (or some set of layers organized into a mod-ule) to fully process its inputs, both in the forward or back-ward pass and call this a computation step. We define thecomputation cycle as the sequence of computation steps thata given data frame is used to update all the layers, and thecycle length as the number of computation steps in the com-putation cycle. Hence, the cycle length depends only on thedepth of the network D and is equal to 2D − 1 computationsteps. Figure 2 illustrates a single computation cycle withnine computation steps for both models.3.3. The BP algorithm (‘regular’ backpropagation)The BP algorithm refers to regular training of neural net-works. Here, due to the synchronization between the passes,computations are blocked each time a data frame is pro-cessed. This is illustrated in Figure 2 (left). Whenever thefirst frame is processed, here indicated by the blue square,the computations are blocked in both forward and backwardpasses over the whole computation cycle.With our notation, the standard backpropagation formulabecomes∇tθiL=∇θiL(Mθ(xt), yt)|θ=θ =∇hDL(htD, yt) · JhD−1HD(htD−1, θD) ·JhD−2HD−1(htD−2, θD−1) ·...JhiHi+1(hti, θi+1) ·JθiHi(hti−1, θi)with the update rule θi := θi −α 1K�Kt=1 ∇tθiL, where α isthe learning rate, and K is the length of the input sequence.We can compactly describe the algorithm above with thefollowing recursive rules∇tθiL=∇thiL · JθiHi(hti−1, θi)(1)∇thi−1L=∇thiL · Jhi−1Hi(hti−1, θi)(2)where ht0 = xt. However, note that in standard imple-mentations, Jacobian matrices are not computed explicitly;instead efficient vector matrix multiplications are used tobackpropagate errors from the loss layer towards the in-put [1].3.4. Sideways algorithmWe aim at pipelining computations for the whole com-putation cycle during training and inference. Sideways re-moves synchronization by continuously processing infor-mation, either in the forward or backward pass. This isillustrated in Figure 2 (right). Once a data frame is avail-able, it is immediately processed and sent to the next layer,11836‘freeing’ the current layer so it can process the next dataframe. Hence, in the first computation step of the computa-tion cycle, a data frame xt is processed by the first Sidewaysmodule, freeing resources and ‘sending’ ht1 to the secondSideways module at computation step t+1. At computationstep t+1, the first module can now take the next data framext+1 for processing, and, simultaneously, the second mod-ule processes ht1; this step results in two representations ht2and ht+11. Please note that our notation ht2 does not indicatethe current computation step but instead that the represen-tation has originated at xt. We continue the same processfurther during the training. This is illustrated in Figure 2,where we use color-encoding to track where the informa-tion being processed has originated from. Dotted arrowsrepresents the forward pass.For simplicity, we assume that the computation of theloss takes no time and does not require an extra compu-tation cycle. In such setting the activation arriving at theloss function computing module at timestep t is ht−D+1D,an activation spawned by the frame xt−D+1. Once this finalrepresentation ht−D+1Dis computed at computation step t,we calculate its ‘correct’ gradient ∇thDL(ht−D+1D, yt), andwe backpropagate this information down towards the lowerlayers of the neural network. This computational process isillustrated in Figure
下载后可阅读完整内容,剩余1页未读,立即下载
cpongm
- 粉丝: 4
- 资源: 2万+
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- RTL8188FU-Linux-v5.7.4.2-36687.20200602.tar(20765).gz
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
- SPC统计方法基础知识.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功