文件概述：这个论述的内容没有给出，因此无法给出一个20字中文标题

11 浏览量更新于2023-10-23 收藏 4.06MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

Scale-space ﬂow for end-to-end optimized video compressionEirikur Agustsson, David Minnen, Nick Johnston, Johannes Ballé, Sung Jin Hwang, George TodericiGoogle Research, Perception Team{eirikur, dminnen, nickj, jballe, sjhwang, gtoderici}@google.comAbstractDespite considerable progress on end-to-end optimizeddeep networks for image compression, video coding re-mains a challenging task. Recently proposed methods forlearned video compression use optical ﬂow and bilinearwarping for motion compensation and show competitiverate–distortion performance relative to hand-engineeredcodecs like H.264 and HEVC. However, these learning-based methods rely on complex architectures and trainingschemes including the use of pre-trained optical ﬂow net-works, sequential training of sub-networks, adaptive ratecontrol, and buffering intermediate reconstructions to diskduring training. In this paper, we show that a generalizedwarping operator that better handles common failure cases,e.g. disocclusions and fast motion, can provide competi-tive compression results with a greatly simpliﬁed model andtraining procedure. Speciﬁcally, we propose scale-spaceﬂow, an intuitive generalization of optical ﬂow that addsa scale parameter to allow the network to better model un-certainty. Our experiments show that a low-latency videocompression model (no B-frames) using scale-space ﬂowfor motion compensation can outperform analogous state-of-the art learned video compression models while beingtrained using a much simpler procedure and without anypre-trained optical ﬂow networks.1. IntroductionRecently, there has been signiﬁcant progress in thearea of end-to-end optimized image compression, whichwent from barely matching JPEG [33] to methods suchas [8, 26, 5] that can outperform the best hand-engineeredcodecs when evaluated in terms of multi-scale structuralsimilarity (MS-SSIM) [36], PSNR, and subjective qualityassessments from user studies. While this is very encourag-ing, over 60% of downstream internet trafﬁc currently con-sists of streaming video data [1], which means that in orderto maximize impact on bandwidth reduction, researchersshould focus on video compression.Since the area of neural video compression is in earlyFigure 1. Our proposed scale-space warping module. From thesource image x, we construct a ﬁxed-resolution scale-space vol-ume X. In contrast to bilinear warping, where the warped outputis sampled directly from the 2-D source image using a 2-channeldisplacement ﬁeld (fx, fy), we trilinearly sample from the 3-Dscale-space volume using a 3-channel displacements+scale ﬁeld(gx, gy, gz). The scale value gives a continuous, differentiableknob that can adaptively blur the source image when warping ifthe warp is not a good prediction of the target image.stages, it is not yet clear which network architectures aremost effective for different application scenarios. We canroughly categorize the existing research methods into thefollowing three categories:1) 3D autoencoders are a natural extension of the workdone for learned image compression, but [27] demonstratedthat representing video using spatiotemporal transforma-tions alone does not lead to better performance comparedto standard methods. However, when combined with tem-porally conditioned entropy models [19], such methods canperform on par with standard methods in terms of MS-SSIM.2) Frame interpolation methods use neural networks totemporally interpolate between frames in a video and thenencode the residuals [38, 17]. This approach is commonlyused in standard video coding (called “bidirectional predic-tion” or “B-frame coding”) [37], but has the disadvantagethat it is generally not suitable for low-latency streamingsince such methods need information “from the future” todecode each B-frame. However, in standard codecs, the useof B-frames typically provides the best rate–distortion (RD)18503performance when low-latency decoding is not required.3) Motion compensation via optical ﬂow is based on esti-mating and compressing optical ﬂow which is applied withbilinear warping to a previously decoded frame to obtain aprediction of the frame currently being encoded [24, 30].The residual error is then separately compressed to reducetotal distortion and minimize temporal error accumulation.Recently published methods in this setting achieve com-pression that outperforms H.264 in terms of PSNR and thatoutperforms HEVC in terms of MS-SSIM [24, 30]. How-ever, these methods rely on complex architectures and train-ing schemes, such as pre-trained optical ﬂow networks [24],sequential training of sub-networks [30, 24], adaptive ratecontrol during encoding [30] and buffering intermediate re-constructions to disk during training [24].Our research focuses on the third class of approaches,since it provides a good balance between rate–distortionperformance and applicability to low-latency video stream-ing. However, we argue that using pre-trained optical ﬂownetworks [24] and bilinear warping [24, 30] may not beideal for motion compensation:1. General ﬂow estimation needs to solve the aperture prob-lem, which is not an issue for compression, so the modelneedlessly solves a harder problem than required. More-over, optical ﬂow networks aim to minimize motion vectorerror, while compression seeks to minimize a compromisebetween bitrate (the entropy of the latent representation ofthe ﬂow and residual) and distortion (reconstruction error).2. The need to rely on existing optical ﬂow network archi-tectures thus potentially adds unnecessary constraints andcomplexity to the design of compression networks.3. The best optical ﬂow models require a supervised train-ing stage for state-of-the-art performance, which relies onannotated ﬂow data, complicates the training procedure, andlimits the domains of applicability.4. Unlike standard video codecs that use motion compensa-tion vectors, optical ﬂow is dense, meaning that every pixelis warped. Since there is no concept of “not using” a ﬂowprediction, unnecessarily large residuals are expected in thecase of disocclusions.To address these concerns, we propose generalizing opti-cal ﬂow and bilinear warping to scale-space ﬂow and scale-space warping (see Figure 1), where a scale ﬁeld is added asa third dimension to the typical 2-channel ﬂow ﬁeld. Thisper-location scale parameter allows the warping operationto better handle difﬁcult cases and to more gracefully de-grade when no ﬂow-based prediction is possible. The scaledimension allows the model to learn to adaptively blur thesource content before warping based on how well it predictsthe next frame. Intuitively, this should lead to a smaller in-termediate residual error and, in turn, to a more compress-ible residual since the model won’t need to spend as manyx0I-DecoderI-Encoderx̂0[z0]Scale Space Flow EncoderScale Space Flow DecoderxiScale Space Warpingx̂iResidual DecoderResidual Encoder-+[wi]giri[vi]r̂ix̄ i. . .x̂i-1First input frameFirst reconstructionPrevious reconstructionCurrent reconstructionCurrent input frameWarped predictionFigure 2. Overview of our end-to-end optimized, low-latency com-pression system: 1) the scale-space ﬂow is jointly estimated andencoded to a quantized latent, [wi]; 2) the previous reconstruction,ˆxi−1, is warped using the decoded scale-space ﬂow ﬁeld, gi, yield-ing the prediction, ¯xi; 3) the remaining residual, ri = xi − ¯xi, isencoded to a quantized latent, [vi], and is decoded to ˆri, whichis added to the warped prediction to get the ﬁnal reconstruction,ˆxi = ¯xi + ˆri. All of the encoder & decoder networks are simplefour layer CNNs trained concurrently after random initialization.bits to “undo errors” introduced by the warping step.Furthermore, we show that a scale-space warping op-eration integrated into a simple low-latency compressionpipeline (depicted in Figure 2) can yield rate–distortion re-sults outperforming recent state-of-the-art learning-basedmethods. Speciﬁcally, for equal PSNR, our method pro-vides an average Bjøntegaard Delta (BD) rate reduction [12]of 13.4% compared to [24] and a savings of 42.9%over [38], while we see a 30.3% savings over [19] for equalMS-SSIM (see Section 5 for a detailed evaluation). Com-pared to prior approaches for ﬂow-based motion compensa-tion [24, 30], our system is signiﬁcantly simpler since wedo not need to separately estimate ﬂow or use pre-trainednetworks. We also do not need to use advanced training orencoding strategies such as buffering reconstructions [24]or spatially adaptive rate control [30].Our ablation studies show that compared to bilinearwarping, the proposed scale-space warping signiﬁcantly im-proves the rate–distortion performance with gains of morethan 1dB at some bitrates (see Section 5 for details).In summary, our contributions are the following:1. We propose scale-space ﬂow and warping, an intuitivegeneralization to ﬂow + bilinear warping that reduces the8504need for complex residuals in failure cases.2. Using a simple architecture and training procedure, weare able to train our model end-to-end without utilizing apre-trained optical ﬂow network.3. Our experiments show that scale-space ﬂow outperformsrecent state-of-the-art models such as [24, 19], while ourablation study shows that the same system trained for ﬂowand bilinear warping performs signiﬁcantly worse.2. Related WorkImage Compression Research on learning-based imagecompression [7, 10, 32, 4, 29, 8, 25, 5] has shown signif-icant progress in terms of rate–distortion performance com-pared to standard codecs such as JPEG [34], JPEG2000 [21]and BPG [11]. Recent state-of-the-art models [40, 15, 26]use hyperprior-based architectures [8] with improvementsincluding autoregressive context models [26] and multi-ratetraining [15]. We consider these models to be foundationalbuilding blocks for learned compression and use the hyper-prior architecture as part of our video compression model.Standard Video Compression There is a long historyof progress for hand-engineered video compression algo-rithms used to create video format standards. Compressionrates have progressively improved, e.g., from H.263 [16], toH.264 [31] and more recently to HEVC [3]. These codecsprovide a strong baseline for assessing the quality of learnedvideo compression models, and HEVC in particular remainsa strong competitor that often outperforms state-of-the-artlearning-based methods.Learned Video Compression As mentioned above, recentwork on learned video compression roughly falls into threecategories, of which motion compensation via optical ﬂowis most related to our work. The architecture we adopt canbe viewed as a greatly simpliﬁed version of the method in[24], which uses a pre-trained ﬂow network [28] combinedwith a ﬂow compression module. In contrast, we directlylearn the motion estimation module from scratch (see ScaleSpace Flow Encoder in Figure 2) which jointly estimatesand encodes the motion from the current input frame andthe previous reconstruction.The training process of [24] happens in sequential steps:the I-frame model is trained ﬁrst and then the P-framemodel, which only sees one frame at a time, is optimized.To ensure the P-frame model can handle its own output asinput, reconstructions from the P-frame model are bufferedto disk during training and fed back to the model. Thiscomplicates the training process and means that the P-framemodel is trained using “stale” reconstructions from an olderversion of the model. In contrast, we concurrently train theI-frame and P-frame models from scratch, unrolling the P-frame model over multiple frames during training, whichgreatly simpliﬁes the training procedure.Scale-space for ﬂow estimation The use of scale-spacetechniques has a long history in optical ﬂow estimation,both with classical techniques (e.g. [6, 18, 13]) as well asthe use of multi-scale pyramids in deep ﬂow estimation net-works [28, 14].However, these works make use of thescale-space only for ﬂow estimation, while the ﬁnal resultis still a standard 2-channel displacement ﬁeld. In contrast,our estimated 3-channel scale-space ﬂow directly integratesinto our proposed scale-space warping operation (see Fig-ure 1) – irrespective of whether a scale-space or multi-scalepyramid is used to estimate it.Uncertainty estimates for optical ﬂow The scale parame-ter of our proposed scale-space ﬂow (see Figure 1) can beinterpreted as an “uncertainty parameter” in the sense thatit is natural to use a high scale value in regions where it isnot feasible to obtain a good prediction via warping. Whileprior work on supervised optical ﬂow studied how to inte-grate uncertainty into the predictions of ﬂow estimation net-works (see [20] for overview), such methods operate in thesupervised setting: i.e. they predict the uncertainty in theprediction of ground truth ﬂow. In contrast, this work fo-cuses on generalizing the ﬂow + warping operations so thatthe warped result forms a good prediction irrespective ofthe relationship between the displacement ﬁeld and groundtruth ﬂow.3. Method3.1. Scale-space ﬂowOur proposed scale-space ﬂow (see Figure 1 for anoverview) generalizes ﬂow and bilinear warping to also in-corporate Gaussian blurring. Given an image x with a spa-tial shape of H × W and a ﬂow ﬁeld f = (fx, fy), the bilin-ear warping of x by f is denoted asx0 := Bilinear-Warp(x, f)s.t.x0[x, y] = x[x + fx[x, y], y + fy[x, y]](1)where x[x, y] denotes sampling the image x at (continu-ous) coordinates (x, y) using bilinear interpolation. We re-fer to the ﬂow channels fx, fy ∈ RH⇥W as the x- and y-displacement ﬁelds of the ﬂow f.For scale-space warping, we construct a ﬁxed-resolutionscale-space volume X = [x, x∗G(σ0), x∗G(2σ0), · · · , x∗G(2M�1σ0)], where x ∗ G(σ) denotes the convolution ofx with a Gaussian kernel with scale σ.X represents astack of progressively blurred versions of x with dimensionsH × W × (M + 1), which we can sample at continuous co-ordinates (x, y, z) via trilinear interpolation.We can now deﬁne a scale-space ﬂow ﬁeld as a 3-channelﬁeld g := (gx, gy, gz), and the corresponding scale-space8505Previous reconstruction ˆxi�1Displacement Field (gx, gy)Scale Field gzScale Space Warped Prediction ¯xiDecoded Residual ˆriFinal Reconstruction ˆxiFigure 3. Visualization of the the internals of our model. The network learns to predict spatial ﬂow even for a crowded scene. Note howthe scale parameter increases around the boundaries of the people where warping is least likely to provide an accurate reconstruction.Similarly, in the bottom left corner of the image, the motion of the hands is not modeled well by warping so the network predicts a largerscale value that results in a blurrier intermediate reconstruction that ultimately helps minimize the global RD loss.warp of the image x asx0 := Scale-Space-Warp(x, g)s.t.x0[x, y] = X[x + gx[x, y], y + gy[x, y], gz[x, y]](2)We refer to the newly introduced third ﬂow channel gz ∈RH⇥W+as the scale ﬁeld of the scale-space ﬂow g.We note that Scale-Space-Warp is strictly more generalthan

下载后可阅读完整内容，剩余1页未读，立即下载