根据论述，可能的中文标题有：文件论述总结、论述的总结、文字论述的总结、总结文件论述

133 浏览量更新于2023-10-23 收藏 3.15MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

Sideways: Depth-Parallel Training of Video ModelsMateusz Malinowskimateuszm@google.comGrzegorz ´Swirszczswirszcz@google.comJo˜ao Carreirajoaoluis@google.comViorica P˘atr˘auceanviorica@google.comDeepMindLondon, U.K.AbstractWe propose Sideways, an approximate backpropagationscheme for training video models. In standard backpropa-gation, the gradients and activations at every computationstep through the model are temporally synchronized. Theforward activations need to be stored until the backwardpass is executed, preventing inter-layer (depth) paralleliza-tion. However, can we leverage smooth, redundant inputstreams such as videos to develop a more efﬁcient trainingscheme? Here, we explore an alternative to backpropaga-tion; we overwrite network activations whenever new ones,i.e., from new frames, become available. Such a more grad-ual accumulation of information from both passes breaksthe precise correspondence between gradients and activa-tions, leading to theoretically more noisy weight updates.Counter-intuitively, we show that Sideways training of deepconvolutional video networks not only still converges, butcan also potentially exhibit better generalization comparedto standard synchronized backpropagation.1. IntroductionThe key ingredient of deep learning is stochastic gradi-ent descent (SGD) [7, 42, 53], which has many variants,including SGD with Momentum [47], Adam [26], and Ada-grad [14]. E.g., SGD approximates gradients using mini-batches sampled from full datasets. Efﬁciency considera-tions primarily motivated the development of SGD as manydatasets do not ﬁt in memory. Moreover, computing fullgradients over them would take a long time, compared tomini-batches, i.e., performing SGD steps is often more pre-ferred [7, 16, 53]. However, SGD is not only more efﬁcientbut also produces better models. E.g, giant-sized modelstrained using SGD are naturally regularized and may gener-alize better [18, 43], and local minima do not seem to be aproblem [11]. Explaining these phenomena is still an opentheoretical problem, but it is clear that SGD is doing morethan merely optimizing a given loss function [52].tt+15t+30Figure 1: Three frames of a ﬁsh swimming, sampled 15frames apart, or about every half a second. Note how littlevariation there is in the patch within the red square. Canwe leverage such redundancies and the smoothness in lo-cal neighborhoods of such type of data for more efﬁcienttraining? Our results suggest we can and there could begeneralization beneﬁts in doing that.In this paper, we propose a further departure from thegradient descent, also motivated by efﬁciency considera-tions, which trains models that operate on sequences ofvideo frames. Gradients of neural networks are computedusing the backpropagation (BP) algorithm. However, BPoperates in a synchronized blocking fashion: ﬁrst, activa-tions for a mini-batch are computed and stored during theforward pass, and next, these activations are re-used to com-pute Jacobian matrices in the backward pass. Such blockingmeans that the two passes must be done sequentially, whichleads to high latency, low throughput. This is particularlysub-optimal if there are parallel processing resources avail-able, and is particularly prominent if we cannot parallelizeacross batch or temporal dimensions, e.g., in online learningor with causal models.The central hypothesis studied in this paper is whetherwe can backpropagate gradients based on activations fromdifferent timesteps, hence removing the locking betweenthe layers. Intuitively, one reason this may work is that highframe rate videos are temporally smooth, leading to similar11834representations of neighboring frames, which is illustratedin Figure 1.We experiment with two types of tasks that have differ-ent requirements in terms of latency: a per-sequence actionrecognition, and a per-frame autoencoding. In both cases,our models do not use any per-frame blocking during theforward or backward passes. We call the resulting gradientupdate procedure Sideways, owing to the shape of the dataﬂow, shown in Figure 2.In experiments on action recognition, UCF101 [46] andHMDB51 [29], we have found that training with Sidewaysnot only does not diverge but often has led to improved per-formance over BP models, providing a surprising regular-ization effect. Such training dynamics create a new lineof inquiry into the true nature of the success of SGD, asit shows that it is also not critical to have precise alignmentbetween activations and gradients. Additionally, we showthat Sideways provides a nearly linear speedup in trainingwith depth parallelism on multiple GPUs compared to a BPmodel using the same resources. We believe that this re-sult also opens up possibilities for training models at higherframe rates in online settings, e.g., where parallelizationacross mini-batches is not an option.We use per-frame autoencoding task to investigate the ef-fect of the blocking mechanism of BP models in tasks wherethe input stream cannot be buffered or where we require im-mediate responses. This is particularly problematic for BPif the input stream is quickly evolving, i.e., the input changerate is higher than the time required to process the per-stepinput. In this case, the blocking mechanism of BP will re-sult in discarding the new inputs received while the modelis being blocked processing the previous input. However,this is considerably less problematic in Sideways due to itslock-free mechanism. We run experiments on syntheticallygenerated videos from the CATER dataset [15], where weobserve that Sideways outperforms the BP baseline.2. Related WorkOur work connects with different strands of researcharound backpropagation, parallelization and video mod-elling. We list here a few of the most relevant examples.Alternatives to backpropagation. Prior work has shownthat various modiﬁcations of the ‘mathematically correct’backpropagation can actually lead to satisfactory training.For instance, some relaxations of backpropagation imple-mented with a ﬁxed random matrix yield a surprisinglygood performance on MNIST [31]. There is also a recentgrowing interest in building more biologically-plausible ormodel-parallel approaches to train networks. This includesFeedback Alignment [31], Direct Feedback Alignment [37],Target Propagation [5], Kickback [2], Online AM [10], Fea-tures Replay [21], Decoupled Features Replay [3], and Syn-thetic Gradients [23], where various decouplings betweenforward or backward pass are proposed. A good compara-tive overview of those frameworks is presented in [12]. An-other recent innovative idea is to meta-learn local rules forgradient updates [34], or to use either self-supervised tech-niques [39] or local losses to perform gradient-isolated up-dates locally [32, 38]. Asynchronous distributed SGD ap-proaches like Hogwild [41] also do not strictly ﬁt into cleanbackprop as they allow multiple workers to partially over-write each others weight updates, but provide some theoret-ical guarantees as long as these overwrites are sparse. How-ever, most of these prior works are applied to visually sim-pler domains, some require buffering activations over manytraining steps, or investigate local communication only. Incontrast, here, we take advantage of the smoothness of tem-poral data. Moreover, we investigate a global, top-down,and yet asynchronous communication between the layers ofa neural network during its training without buffering acti-vations over longer period and without auxiliary networksor losses. This view is consistent with some mathematicalmodels of cortex [6, 28, 30, 48]. We also address forwardand backward locking for temporal models. Finally, most ofthe works above can also potentially be used together withour Sideways training, which we leave as a possible futuredirection.Large models. Parallelism has grown in importance dueto the success of gigantic neural networks with billionsof parameters [49], potentially having high-resolution in-puts [40], that cannot ﬁt into individual GPUs. Approachessuch as GPipe [20] or DDG [22] show that efﬁcient pipelin-ing strategies can be used to decouple the forward andbackward passes by buffering activations at different layers,which then enables the parallel execution of different layersof the network. Similarly, multiple modules of the networkcan be processed simultaneously on activations belongingto different mini-batches [22]. Such pipelining reduces thetraining time for image models but at the cost of increasedmemory footprint.Efﬁcient video processing. Conditional computation [4] orhard-attention approaches can increase efﬁciency [33, 35]when dealing with large data streams. These are, however,generic approaches that do not exploit the temporal smooth-ness of sequential data such as video clips [50]. For video,sampling key frames is shown to be a quite powerful mech-anism when performing classiﬁcation [27, 51], but may notbe appropriate if a more detailed temporal representationof the input sequence is needed [15]. Recently, a deep de-coupled video model [8] has been proposed that achieveshigh throughput and speed at inference time, while preserv-ing the accuracy of sequential models. However, [8] usesregular backprop, and hence does not beneﬁt from paral-lelization fully, i.e., backprop still blocks the computations,and requires buffering activations during the forward pass.11835In this paper, we build upon [8] that uses parallel inference,but go further and make both inference and learning depth-parallel. Note that, if we only consider inference, Sidewaysreduces to [8].3. SidewaysIn this section, we deﬁne the formulation of our problemand formalize both algorithms: BP and Sideways.3.1. Notation and DeﬁnitionsWe consider the following general setting:• a ﬁnite input time-series x = (xt)Kt=1, xt ∈ Rd, e.g.,a video clip with d = height × width × 3,• a ﬁnite output time-series y = (yt)Kt=1, yt ∈ Rdy,e.g., an action label; in the action recognition task, inour work, we use the same label over the whole videoclip, i.e., yt = yt+1 for all t,• a frame-based neural network Mθ : Rd → Rdythat transforms the input signal xt into logits htD =Mθ(xt), and is deﬁned by a composition of modulesMθ(xt) = HD(·, θD)◦HD−1(·, θD−1)◦. . .◦H1(xt, θ1)where:– each module, or layer, Hi(·, ·) is a function Hi :Rdi−1 × Rpi → Rdi, i = 1, . . . D,– θi ∈ Rpi, i = 1, . . . , D are the (trainable) pa-rameters, and we use θ for all the parameters,– ◦ is a composition, i.e., G ◦ F(x) = G(F(x))and• a loss function L : Rdy × Rdy → R, e.g., L(h, y) =||h − y||2, or L(h, y) = − �i p((h)i) log q(yi).We extend the notation above to hti=Hi(·, θi) ◦Hi−1(·, θi−1) ◦ . . . ◦ H1(xt, θ1).To avoid the common confusion coming from using thesame letters to denote both the function formal argumentsand actual values of the variables, we will use bold fontfor the latter, e.g., x to denote a formal argument and xfor its actual value.We also use the following notationfor the derivatives of the functions Hi. Let JhH(h, θ) =∂H(h,θ)∂h��h=h be the Jacobian matrix of H(h, θ) with re-spect to the variable h evaluated at h = h, θ = θ. Similarly,JθH(h, θ) =∂H(h,θ)∂θ��θ=θ denote the Jacobian matrix ofH(h, θ) with respect to the variable θ evaluated at h = h,θ = θ. We will use the same notation for the gradient ∇.Finally, to train neural networks, we base our com-putations on the empirical risk minimization frame-work,i.e.R(Mθ)=Ex,y[L(Mθ(x), y)]≈�x,y∼D1K�Kt=1 L(htD, yt), where D is a training set.3.2. Update CycleFor simplicity, we assume in our modelling a constanttime for a layer (or some set of layers organized into a mod-ule) to fully process its inputs, both in the forward or back-ward pass and call this a computation step. We deﬁne thecomputation cycle as the sequence of computation steps thata given data frame is used to update all the layers, and thecycle length as the number of computation steps in the com-putation cycle. Hence, the cycle length depends only on thedepth of the network D and is equal to 2D − 1 computationsteps. Figure 2 illustrates a single computation cycle withnine computation steps for both models.3.3. The BP algorithm (‘regular’ backpropagation)The BP algorithm refers to regular training of neural net-works. Here, due to the synchronization between the passes,computations are blocked each time a data frame is pro-cessed. This is illustrated in Figure 2 (left). Whenever theﬁrst frame is processed, here indicated by the blue square,the computations are blocked in both forward and backwardpasses over the whole computation cycle.With our notation, the standard backpropagation formulabecomes∇tθiL=∇θiL(Mθ(xt), yt)|θ=θ =∇hDL(htD, yt) · JhD−1HD(htD−1, θD) ·JhD−2HD−1(htD−2, θD−1) ·...JhiHi+1(hti, θi+1) ·JθiHi(hti−1, θi)with the update rule θi := θi −α 1K�Kt=1 ∇tθiL, where α isthe learning rate, and K is the length of the input sequence.We can compactly describe the algorithm above with thefollowing recursive rules∇tθiL=∇thiL · JθiHi(hti−1, θi)(1)∇thi−1L=∇thiL · Jhi−1Hi(hti−1, θi)(2)where ht0 = xt. However, note that in standard imple-mentations, Jacobian matrices are not computed explicitly;instead efﬁcient vector matrix multiplications are used tobackpropagate errors from the loss layer towards the in-put [1].3.4. Sideways algorithmWe aim at pipelining computations for the whole com-putation cycle during training and inference. Sideways re-moves synchronization by continuously processing infor-mation, either in the forward or backward pass. This isillustrated in Figure 2 (right). Once a data frame is avail-able, it is immediately processed and sent to the next layer,11836‘freeing’ the current layer so it can process the next dataframe. Hence, in the ﬁrst computation step of the computa-tion cycle, a data frame xt is processed by the ﬁrst Sidewaysmodule, freeing resources and ‘sending’ ht1 to the secondSideways module at computation step t+1. At computationstep t+1, the ﬁrst module can now take the next data framext+1 for processing, and, simultaneously, the second mod-ule processes ht1; this step results in two representations ht2and ht+11. Please note that our notation ht2 does not indicatethe current computation step but instead that the represen-tation has originated at xt. We continue the same processfurther during the training. This is illustrated in Figure 2,where we use color-encoding to track where the informa-tion being processed has originated from. Dotted arrowsrepresents the forward pass.For simplicity, we assume that the computation of theloss takes no time and does not require an extra compu-tation cycle. In such setting the activation arriving at theloss function computing module at timestep t is ht−D+1D,an activation spawned by the frame xt−D+1. Once this ﬁnalrepresentation ht−D+1Dis computed at computation step t,we calculate its ‘correct’ gradient ∇thDL(ht−D+1D, yt), andwe backpropagate this information down towards the lowerlayers of the neural network. This computational process isillustrated in Figure

下载后可阅读完整内容，剩余1页未读，立即下载