没有合适的资源?快使用搜索试试~ 我知道了~
Scale-space flow for end-to-end optimized video compressionEirikur Agustsson, David Minnen, Nick Johnston, Johannes Ballé, Sung Jin Hwang, George TodericiGoogle Research, Perception Team{eirikur, dminnen, nickj, jballe, sjhwang, gtoderici}@google.comAbstractDespite considerable progress on end-to-end optimizeddeep networks for image compression, video coding re-mains a challenging task. Recently proposed methods forlearned video compression use optical flow and bilinearwarping for motion compensation and show competitiverate–distortion performance relative to hand-engineeredcodecs like H.264 and HEVC. However, these learning-based methods rely on complex architectures and trainingschemes including the use of pre-trained optical flow net-works, sequential training of sub-networks, adaptive ratecontrol, and buffering intermediate reconstructions to diskduring training. In this paper, we show that a generalizedwarping operator that better handles common failure cases,e.g. disocclusions and fast motion, can provide competi-tive compression results with a greatly simplified model andtraining procedure. Specifically, we propose scale-spaceflow, an intuitive generalization of optical flow that addsa scale parameter to allow the network to better model un-certainty. Our experiments show that a low-latency videocompression model (no B-frames) using scale-space flowfor motion compensation can outperform analogous state-of-the art learned video compression models while beingtrained using a much simpler procedure and without anypre-trained optical flow networks.1. IntroductionRecently, there has been significant progress in thearea of end-to-end optimized image compression, whichwent from barely matching JPEG [33] to methods suchas [8, 26, 5] that can outperform the best hand-engineeredcodecs when evaluated in terms of multi-scale structuralsimilarity (MS-SSIM) [36], PSNR, and subjective qualityassessments from user studies. While this is very encourag-ing, over 60% of downstream internet traffic currently con-sists of streaming video data [1], which means that in orderto maximize impact on bandwidth reduction, researchersshould focus on video compression.Since the area of neural video compression is in earlyFigure 1. Our proposed scale-space warping module. From thesource image x, we construct a fixed-resolution scale-space vol-ume X. In contrast to bilinear warping, where the warped outputis sampled directly from the 2-D source image using a 2-channeldisplacement field (fx, fy), we trilinearly sample from the 3-Dscale-space volume using a 3-channel displacements+scale field(gx, gy, gz). The scale value gives a continuous, differentiableknob that can adaptively blur the source image when warping ifthe warp is not a good prediction of the target image.stages, it is not yet clear which network architectures aremost effective for different application scenarios. We canroughly categorize the existing research methods into thefollowing three categories:1) 3D autoencoders are a natural extension of the workdone for learned image compression, but [27] demonstratedthat representing video using spatiotemporal transforma-tions alone does not lead to better performance comparedto standard methods. However, when combined with tem-porally conditioned entropy models [19], such methods canperform on par with standard methods in terms of MS-SSIM.2) Frame interpolation methods use neural networks totemporally interpolate between frames in a video and thenencode the residuals [38, 17]. This approach is commonlyused in standard video coding (called “bidirectional predic-tion” or “B-frame coding”) [37], but has the disadvantagethat it is generally not suitable for low-latency streamingsince such methods need information “from the future” todecode each B-frame. However, in standard codecs, the useof B-frames typically provides the best rate–distortion (RD)18503performance when low-latency decoding is not required.3) Motion compensation via optical flow is based on esti-mating and compressing optical flow which is applied withbilinear warping to a previously decoded frame to obtain aprediction of the frame currently being encoded [24, 30].The residual error is then separately compressed to reducetotal distortion and minimize temporal error accumulation.Recently published methods in this setting achieve com-pression that outperforms H.264 in terms of PSNR and thatoutperforms HEVC in terms of MS-SSIM [24, 30]. How-ever, these methods rely on complex architectures and train-ing schemes, such as pre-trained optical flow networks [24],sequential training of sub-networks [30, 24], adaptive ratecontrol during encoding [30] and buffering intermediate re-constructions to disk during training [24].Our research focuses on the third class of approaches,since it provides a good balance between rate–distortionperformance and applicability to low-latency video stream-ing. However, we argue that using pre-trained optical flownetworks [24] and bilinear warping [24, 30] may not beideal for motion compensation:1. General flow estimation needs to solve the aperture prob-lem, which is not an issue for compression, so the modelneedlessly solves a harder problem than required. More-over, optical flow networks aim to minimize motion vectorerror, while compression seeks to minimize a compromisebetween bitrate (the entropy of the latent representation ofthe flow and residual) and distortion (reconstruction error).2. The need to rely on existing optical flow network archi-tectures thus potentially adds unnecessary constraints andcomplexity to the design of compression networks.3. The best optical flow models require a supervised train-ing stage for state-of-the-art performance, which relies onannotated flow data, complicates the training procedure, andlimits the domains of applicability.4. Unlike standard video codecs that use motion compensa-tion vectors, optical flow is dense, meaning that every pixelis warped. Since there is no concept of “not using” a flowprediction, unnecessarily large residuals are expected in thecase of disocclusions.To address these concerns, we propose generalizing opti-cal flow and bilinear warping to scale-space flow and scale-space warping (see Figure 1), where a scale field is added asa third dimension to the typical 2-channel flow field. Thisper-location scale parameter allows the warping operationto better handle difficult cases and to more gracefully de-grade when no flow-based prediction is possible. The scaledimension allows the model to learn to adaptively blur thesource content before warping based on how well it predictsthe next frame. Intuitively, this should lead to a smaller in-termediate residual error and, in turn, to a more compress-ible residual since the model won’t need to spend as manyx0I-DecoderI-Encoderx̂0[z0]Scale Space Flow EncoderScale Space Flow DecoderxiScale Space Warpingx̂iResidual DecoderResidual Encoder-+[wi]giri[vi]r̂ix̄ i. . .x̂i-1First input frameFirst reconstructionPrevious reconstructionCurrent reconstructionCurrent input frameWarped predictionFigure 2. Overview of our end-to-end optimized, low-latency com-pression system: 1) the scale-space flow is jointly estimated andencoded to a quantized latent, [wi]; 2) the previous reconstruction,ˆxi−1, is warped using the decoded scale-space flow field, gi, yield-ing the prediction, ¯xi; 3) the remaining residual, ri = xi − ¯xi, isencoded to a quantized latent, [vi], and is decoded to ˆri, whichis added to the warped prediction to get the final reconstruction,ˆxi = ¯xi + ˆri. All of the encoder & decoder networks are simplefour layer CNNs trained concurrently after random initialization.bits to “undo errors” introduced by the warping step.Furthermore, we show that a scale-space warping op-eration integrated into a simple low-latency compressionpipeline (depicted in Figure 2) can yield rate–distortion re-sults outperforming recent state-of-the-art learning-basedmethods. Specifically, for equal PSNR, our method pro-vides an average Bjøntegaard Delta (BD) rate reduction [12]of 13.4% compared to [24] and a savings of 42.9%over [38], while we see a 30.3% savings over [19] for equalMS-SSIM (see Section 5 for a detailed evaluation). Com-pared to prior approaches for flow-based motion compensa-tion [24, 30], our system is significantly simpler since wedo not need to separately estimate flow or use pre-trainednetworks. We also do not need to use advanced training orencoding strategies such as buffering reconstructions [24]or spatially adaptive rate control [30].Our ablation studies show that compared to bilinearwarping, the proposed scale-space warping significantly im-proves the rate–distortion performance with gains of morethan 1dB at some bitrates (see Section 5 for details).In summary, our contributions are the following:1. We propose scale-space flow and warping, an intuitivegeneralization to flow + bilinear warping that reduces the8504need for complex residuals in failure cases.2. Using a simple architecture and training procedure, weare able to train our model end-to-end without utilizing apre-trained optical flow network.3. Our experiments show that scale-space flow outperformsrecent state-of-the-art models such as [24, 19], while ourablation study shows that the same system trained for flowand bilinear warping performs significantly worse.2. Related WorkImage Compression Research on learning-based imagecompression [7, 10, 32, 4, 29, 8, 25, 5] has shown signif-icant progress in terms of rate–distortion performance com-pared to standard codecs such as JPEG [34], JPEG2000 [21]and BPG [11]. Recent state-of-the-art models [40, 15, 26]use hyperprior-based architectures [8] with improvementsincluding autoregressive context models [26] and multi-ratetraining [15]. We consider these models to be foundationalbuilding blocks for learned compression and use the hyper-prior architecture as part of our video compression model.Standard Video Compression There is a long historyof progress for hand-engineered video compression algo-rithms used to create video format standards. Compressionrates have progressively improved, e.g., from H.263 [16], toH.264 [31] and more recently to HEVC [3]. These codecsprovide a strong baseline for assessing the quality of learnedvideo compression models, and HEVC in particular remainsa strong competitor that often outperforms state-of-the-artlearning-based methods.Learned Video Compression As mentioned above, recentwork on learned video compression roughly falls into threecategories, of which motion compensation via optical flowis most related to our work. The architecture we adopt canbe viewed as a greatly simplified version of the method in[24], which uses a pre-trained flow network [28] combinedwith a flow compression module. In contrast, we directlylearn the motion estimation module from scratch (see ScaleSpace Flow Encoder in Figure 2) which jointly estimatesand encodes the motion from the current input frame andthe previous reconstruction.The training process of [24] happens in sequential steps:the I-frame model is trained first and then the P-framemodel, which only sees one frame at a time, is optimized.To ensure the P-frame model can handle its own output asinput, reconstructions from the P-frame model are bufferedto disk during training and fed back to the model. Thiscomplicates the training process and means that the P-framemodel is trained using “stale” reconstructions from an olderversion of the model. In contrast, we concurrently train theI-frame and P-frame models from scratch, unrolling the P-frame model over multiple frames during training, whichgreatly simplifies the training procedure.Scale-space for flow estimation The use of scale-spacetechniques has a long history in optical flow estimation,both with classical techniques (e.g. [6, 18, 13]) as well asthe use of multi-scale pyramids in deep flow estimation net-works [28, 14].However, these works make use of thescale-space only for flow estimation, while the final resultis still a standard 2-channel displacement field. In contrast,our estimated 3-channel scale-space flow directly integratesinto our proposed scale-space warping operation (see Fig-ure 1) – irrespective of whether a scale-space or multi-scalepyramid is used to estimate it.Uncertainty estimates for optical flow The scale parame-ter of our proposed scale-space flow (see Figure 1) can beinterpreted as an “uncertainty parameter” in the sense thatit is natural to use a high scale value in regions where it isnot feasible to obtain a good prediction via warping. Whileprior work on supervised optical flow studied how to inte-grate uncertainty into the predictions of flow estimation net-works (see [20] for overview), such methods operate in thesupervised setting: i.e. they predict the uncertainty in theprediction of ground truth flow. In contrast, this work fo-cuses on generalizing the flow + warping operations so thatthe warped result forms a good prediction irrespective ofthe relationship between the displacement field and groundtruth flow.3. Method3.1. Scale-space flowOur proposed scale-space flow (see Figure 1 for anoverview) generalizes flow and bilinear warping to also in-corporate Gaussian blurring. Given an image x with a spa-tial shape of H × W and a flow field f = (fx, fy), the bilin-ear warping of x by f is denoted asx0 := Bilinear-Warp(x, f)s.t.x0[x, y] = x[x + fx[x, y], y + fy[x, y]](1)where x[x, y] denotes sampling the image x at (continu-ous) coordinates (x, y) using bilinear interpolation. We re-fer to the flow channels fx, fy ∈ RH⇥W as the x- and y-displacement fields of the flow f.For scale-space warping, we construct a fixed-resolutionscale-space volume X = [x, x∗G(σ0), x∗G(2σ0), · · · , x∗G(2M�1σ0)], where x ∗ G(σ) denotes the convolution ofx with a Gaussian kernel with scale σ.X represents astack of progressively blurred versions of x with dimensionsH × W × (M + 1), which we can sample at continuous co-ordinates (x, y, z) via trilinear interpolation.We can now define a scale-space flow field as a 3-channelfield g := (gx, gy, gz), and the corresponding scale-space8505Previous reconstruction ˆxi�1Displacement Field (gx, gy)Scale Field gzScale Space Warped Prediction ¯xiDecoded Residual ˆriFinal Reconstruction ˆxiFigure 3. Visualization of the the internals of our model. The network learns to predict spatial flow even for a crowded scene. Note howthe scale parameter increases around the boundaries of the people where warping is least likely to provide an accurate reconstruction.Similarly, in the bottom left corner of the image, the motion of the hands is not modeled well by warping so the network predicts a largerscale value that results in a blurrier intermediate reconstruction that ultimately helps minimize the global RD loss.warp of the image x asx0 := Scale-Space-Warp(x, g)s.t.x0[x, y] = X[x + gx[x, y], y + gy[x, y], gz[x, y]](2)We refer to the newly introduced third flow channel gz ∈RH⇥W+as the scale field of the scale-space flow g.We note that Scale-Space-Warp is strictly more generalthan
下载后可阅读完整内容,剩余1页未读,立即下载
cpongm
- 粉丝: 4
- 资源: 2万+
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- 京瓷TASKalfa系列维修手册:安全与操作指南
- 小波变换在视频压缩中的应用
- Microsoft OfficeXP详解:WordXP、ExcelXP和PowerPointXP
- 雀巢在线媒介投放策划:门户网站与广告效果分析
- 用友NC-V56供应链功能升级详解(84页)
- 计算机病毒与防御策略探索
- 企业网NAT技术实践:2022年部署互联网出口策略
- 软件测试面试必备:概念、原则与常见问题解析
- 2022年Windows IIS服务器内外网配置详解与Serv-U FTP服务器安装
- 中国联通:企业级ICT转型与创新实践
- C#图形图像编程深入解析:GDI+与多媒体应用
- Xilinx AXI Interconnect v2.1用户指南
- DIY编程电缆全攻略:接口类型与自制指南
- 电脑维护与硬盘数据恢复指南
- 计算机网络技术专业剖析:人才培养与改革
- 量化多因子指数增强策略:微观视角的实证分析
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功