自学习视频雨痕去除方法与时间对应相遇

191 浏览量更新于2023-10-23 收藏 15.06MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

12,413∗117200自学习视频雨痕去除：循环一致性与时间对应相遇01 香港城市大学 2 新加坡国立大学 3 北京大学 4耶鲁-新加坡国立学院0摘要0本文针对视频中雨痕去除问题，开发了一种自学习的雨痕去除方法，该方法在训练过程中不需要任何干净的真实图像。该方法的灵感来自于相邻帧之间高度相关且可以被视为相同场景的不同版本，以及雨痕在时间维度上的随机分布。基于此，我们构建了一个两阶段的自学习去雨网络（SLDNet），基于时间相关性和一致性来去除雨痕。在第一阶段，SLDNet利用时间相关性学习预测当前帧的干净版本，基于其相邻的雨天视频帧。在第二阶段，SLDNet强制保持不同帧之间的时间一致性。它同时利用当前雨天帧和相邻的雨天视频帧来恢复结构细节。第一阶段负责重建主要结构，第二阶段负责提取结构细节。我们的网络架构包括两个子任务，即运动估计和雨区检测，并进行联合优化。我们的大量实验证明了我们方法的有效性，无论是定量还是定性都取得了更好的结果。01. 引言0雨天是一种常见的恶劣天气条件，会导致拍摄的视频和图像出现一系列的可见性降低。雨水的存在不仅会导致视觉质量差，还会损害现有的以干净视频帧为输入的计算机视觉系统。雨痕是0�通讯作者。电子邮件：liujiaying@pku.edu.cn。本工作部分得到中国国家自然科学基金（合同号61772043）的支持，部分得到北京市自然科学基金（合同号L182002）的支持，部分得到中国国家重点研发计划（合同号2018AAA0102700）的支持，部分得到香港ITFUICP（授权号9440203）的支持。Robby T.Tan在本工作中的研究得到MOE2019-T2-1-130的支持。0(a) 雨天帧0(b) FastDeRain [22]0(c) MS-CSC [28]0图1.在具有大运动的真实雨天视频帧上，不同去雨方法的视觉效果对比。与MS-CSC [28]和FastDeRain[22]相比，我们的自学习方法没有伪影，并且在去除雨痕方面更加有效。请注意，SpacCNN是一种全监督方法，而我们的方法是自学习的，在训练阶段不需要任何干净的真实视频。0雨痕是最常见的雨天降解类型。它们可以部分遮挡背景场景，改变图像外观，使场景模糊等。除了雨痕，雨水还会产生类似雾的遮蔽效果和附着在镜头或挡风玻璃上的雨滴。在本文中，我们专注于雨痕去除。我们的方法从雨天视频本身学习，不需要在训练过程中使用任何干净的真实视频。一些现有的方法[24，20，38，34]专注于基于空间冗余和细节纹理外观分离无雨背景图像（干净图像）和雨痕。几种数学模型提取判别特征来分离这两个层次，例如频域表示[24]，稀疏表示[34]，高斯混合模型[29]和深度网络[47，14]。除了利用空间冗余外，基于视频的方法[1，2，3，9，12，15，17，18，53]还利用时间相关性和上下文来解决这个问题。最早的方法[17，15，18]利用物理和光度特性，即方向和色彩properties of rain streaks. Later approaches [9, 6, 26, 23]further make use of the temporal dynamics of videos, i.e.the continuity of background layers and the randomness ofrain locations along the temporal dimension, to remove rainstreaks from rain videos. The efﬁciency of deep learningalso leads to the emergence of deep-learning based videorain streak removal [28, 32, 30, 45, 7]. Convolutional neu-ral network (CNN) as well as other advanced deep modelsare developed to better separate rain streaks and backgroundscene, e.g. recurrent neural network [32, 30], convolution-al sparse coding [28]. Many effective priors and featuresare designed, e.g. explicit temporal correlations [7], scalevarieties of rain streaks [28] and motion contexts [32, 30].With the power of learning from data, the CNN-basedmethods outperform previous traditional methods. How-ever, most of these learning methods require ground-truthimages that are free from rain streaks. For this, they em-ploy synthetic rain images, since to have pairs of real rainimages and ground-truth rain-free images is intractable toobtain. Hence, the accuracy of these methods depends onthe quality of the synthesized rain.In this paper, we aim to develop a deep-learning basedmethod that does not require any clean video ground-truthsin the training process. Thus, unlike most of the learningmethods, we do not use any rain synthetic training data.By making full use of temporal correlation and consisten-cy, a two-stage Self-Learned Deraining Network (SLDNet)is designed to learn how to remove rain streaks solely fromthe input videos and some priors. The ﬁrst stage of SLDNetpredicts the clean current frame based on its adjacent rainvideo frames via video frame interpolation, without any in-formation of the current rain frame. As the rain streaks dis-tribute randomly along the temporal dimension, the resultof this stage is almost rain-free. Yet, when large motion-s are present, some details are blurred due to the intrinsicdifﬁculty in modeling large/fast motions. To avoid theseartifacts, in the second stage of SLDNet, we include theinformation of the current rain frame, bringing in some tex-ture details while ﬁltering out the rain streaks. Our temporalconsistency constraint forces the generated result (with theadded texture details) to be close to other adjacent alignedrain video frames. Along the way, some motion prior andrain-related prior are injected into our method.In summary, our contributions are as follows.• We propose a self-learned video rain streak removalmethod that can learn solely from input videos. To ourknowledge, this is the ﬁrst attempt in video-based rainstreak removal literature. Integrated with both tempo-ral correlation and consistency, the proposed deep net-work, ﬁrst, infers the main structures of a clean frame,and then recovers the details.• Besides the temporal correlation and consistency con-straint, we further inject priors of rain videos, i.e. back-ground motion and rain location information, to bene-ﬁt rain streaks removal without requiring any pairedrain-free ground-truths in the network training. Theseconstraints/priors can possibly open up further the ex-ploration of self-learned video rain streak removal.• We propose a framework that jointly optimizes thebackground motion and rain localization while remov-ing rain streaks. Extensive experiments demonstratethe effectiveness of our joint optimization and thus theeffectiveness of our whole method.2. Related WorkAs the rain causes poor visibility, occludes the back-ground scene and blurs the background, rain removal meth-ods are proposed to restore the clean image from a rainone. One branch is the single-image rain streak removal,which aims to infer the clean image solely based on a sin-gle rain image. Many models are developed to capture theintrinsic differences between the rain signal and normal tex-tures based on the spatial redundancy, e.g. generalized lowrank model [9], sparse coding [24], discriminative sparsecoding [34], nonlocal mean ﬁlter [25], Gaussian mixturemodel [29], transformed low rank model [4], rain directionprior [51]. In 2017, single-image deraining steps into theera of deep-learning and many deep-learning based meth-ods emerge, including deep detail network [14, 13], jointrain detection and removal [47, 48], density-aware multi-stream densely connected CNN [51], perceptual generativeadversarial network [41]. Later works focus on developingadvanced deep networks [27, 42, 35, 49] or utilizing moreeffective priors [19, 5, 54, 50, 52, 46].Compared with single-image rain removal, video rainstreak removal is capable of utilizing temporal correlationand dynamics to detect and remove rains. Garg and Na-yar propose the seminal work of video rain modeling [17]and rain streak removal methods [15, 18, 16]. Later ap-proaches dig deep to see the intrinsic priors rain streakand normal background signals, i.e. temporal and chromat-ic properties of rain [53, 33], the size, shape and orienta-tion of rain streaks [3, 2], phase congruency features [37],Fourier domain feature [1], spatio-temporal correlation ofpatch groups [9], rain directional prior of rain streaks [23],Gaussian mixture model [6], Bayes rain detector [39, 40],two-stage detection and reﬁnement based on SVM [26],patch-based mixtures of Gaussian[44], matrix decomposi-tion [36]. Recently, deep-learning based methods bring sig-niﬁcant changes to video deraining with augmented capac-ities and ﬂexibilities. In [28], Li et al. apply a multiscaleconvolutional sparse coding to remove the rain streaks withdifferent scales. Chen et al. [8] propose to ﬁrstly segmen-t superpixels from a rain frame and then to estimate rain-free superpixels with the consistency constraint among the1721aligned super-pixels. After that, compensate lost details,a CNN is further used to add normal textures to the ﬁnalresults. In [31], Liu et al. build a recurrent neural net-work that seamlessly integrates rain degradation classiﬁca-tion, rain removal and background details reconstruction.In [32], a hybrid rain model is proposed to model both rainstreaks and occlusions, and is then injected into a dynamicrouting residue recurrent network with the motion segmen-tation context information. In [45], a two-stage recurrentnetwork is built with dual-level ﬂow regularizations to per-form the inverse recovery process of the rain synthesis mod-el for video deraining.Previous works are either model-based, designed withhand-crafted features, or data-driven ones, relying on syn-thetic paired data. In our work, we explore the possiblearchitectures and priors for self-learning and construct alearnable video deraining network which does not rely onsynthesized paired data.3. Rain Modeling and Self-Learning Con-straint3.1. Rain Video ModelingWe formulate a rain model as:I = B + R,(1)where B is the layer without rain streaks, and R is the rainstreak layer. I is the captured image with rain streaks. Avideo rain synthesis model is obtained with a temporal in-dicator t added:It = Bt + Rt, t = 1, 2, ..., N,(2)where t and N denote the current time-step and the totalnumber of video frames, respectively. The rain streak Rt isassumed to be independent and identically distributed ran-dom samples. There are also more complicated rain syn-thesis models, e.g. [45] that take into account the rain ac-cumulation, ﬂow, etc. In this paper, we only consider theproblem of rain streak removal by exploring the informa-tion from rain videos.3.2. Temporal Cyclic Consistency for Self-LearnedRain RemovalWe explore intrinsic constraints and priors that facilitatevideo rain streak removal even without paired training data,speciﬁcally our constraints/priors are consisting of three as-pects: temporal correlation, temporal consistency, and rain-related priors.Temporal Correlation. Adjacent clean video frames arehighly correlated. Meaning, the background signal of a rainframe can be predicted by its adjacent rain video frames, s-ince rain streaks are likely randomly distributed. Therefore,if we try to predict a current rain frame based on adjacentrain video frames (without the current one), the rain signalwill not be predicted and the result will tend to be rain-free.However, when the frames include large motions, it is al-so challenging to interpolate a frame based on its adjacentframes, which can lead to blurred details and artifacts.Temporal Consistency. Because non-rain background lay-ers are continuous along the temporal dimension, the videoframes after motion compensation should be well alignedand lead to small differences. Comparatively, even goodmotion estimation and compensation are achieved, the wellaligned rain layers are also very different, due to the ex-istence of rain streaks. Hence, it is beneﬁcial to removerain streaks if we enforce the model to generate the consis-tent results after motion compensation. However, motionmight not be well estimated if large motions are present,and there may be content changes among different frames.In this case, the temporal consistency regularization mightalso fail. Therefore, in our work, we also include the motionestimation as part of our optimization target.Rain-Related Side Information. Besides the above twoconstraints to connect rain video frame and their corre-sponding rain-free versions, we also intend to embed usefulside information to guide the deraining process. The rain-dependent features, i.e. rain mask, can be injected as a partof the loss functions, which control the model to processrain layers adaptively, namely only applying rain removal inthe rain regions. Another kind of features, i.e. optical ﬂow,whose estimation is usually extracted from clean frames, areeasy to be contaminated by the appearance of rain. Opticalﬂow estimation has a complicated and intertwined effect onthe rain streak removal. However, optical ﬂow and rain re-moval can beneﬁt each other if one of their performance isimproved. Hence, optical ﬂow estimation is regarded as onepart of our whole optimization function.4. Self-Learned Deraining Network4.1. Network ArchitectureBased on the discussion in the last section, we build aSelf-Learned Deraining Network (SLDNet) as shown inFig. 2, which consists of three parts:• Warping operation (Fig. 2 (a)). This part extracts theoptical ﬂow [11] as the motion information and applyalignment among frames. This module (particularlyoptical ﬂow) is jointly optimized with the whole de-raining task.• Prediction Network (PredNet) (Fig. 2 (b)). In the train-ing phase, the network aims to predict the rain-freebackground layer of the current frame, based on its ad-jacent rain video frames.• EnHancement Network (EHNet) (Fig. 2 (c)). Guided1722��2112,,,,tttttIII II��2112,,,ttttIIII��2t 2111111111t 211111112111111121111111tI1ˆtBFlowNet2112,,,,tttttIII II��2t 2111111111t 211111112111111121111111(a) Warping(b) PredNet(c) EHNet2112,,,ttttIIII��2t 2111111111t 211111112111111121111111Temporal Correlation……Temporal Consistency……ˆtB�NCtM2ˆtBWarp3D Convs12112ˆ,,,,tttttIIB II��1ˆ12It 2111111122111111111t 211111113D ConvsFlowLFid-TCorLRain RegionFlowLOperationsLosses2112[,, ,]IIttttIIttttCCCC� �� Fid-BLFid-TConLNAiMFigure 2. The framework of our proposed Self-Learning Deraining Network (SLDNet). 1) Warping module aligns the the neighboringframes to the central one. Successive modules make full use of temporal correlations and consistency to create the mapping from rainvideo frames to the clean ones. 2) Prediction Network (PredNet) predicts the clean version of the current frame with the neighboring rainvideo frames, taking the rain version of the current one as the ground truth. 3) EnHancement Network (EHNet) compensates for the detailof the predicted clean layers with both the neighboring and current rain video frames, taking the aligned version of the neighboring rainvideo frames as the ground truth. The red arrows denotes the direction of the information ﬂow and the blue arrows denote main relatedconstraints and losses.by the rain-free estimation produced by PredNet, wethen improve the details via an enhancement network.The network takes the current rain frame and the ad-jacent rain video frames as input, and generates theresidual details under the inter-frame consistency con-straint. A rain mask is incorporated into the loss func-tion to reduce the impact of rain streaks in the groundtruth on the deraining process, and make the networkonly focus on the useful information in non-rain re-gions.In the subsequent sections, we discuss each part of thenetwork in details.4.2. Proposed Networks1) Optical Flow Estimation and Warping (Fig. 2 (a)). Weﬁrst estimate the optical ﬂow and warp the input rain videoframes. G(·) is introduced to denote the processes to extractoptical ﬂow from the given image pair as follows,CIi→j = G(Ii, Ij),(3)CBi→j = G( ˆBi, ˆBj),(4)where the subscript i → j denotes the ﬂow from the i-thframe to the j-th one, and superscript I and B denote theﬂow is estimated from the rain image or the estimated back-ground image. Then, we can warp the image to the j-thtime-step based on the estimated ﬂow:�IIi→j = W(Ii, CIi→j),(5)�BBi→j = W( ˆBi, CBi→j).(6)For simplicity, in Fig. 2, we use �IIi to denote �IIi→j as j isset to t for the whole process. To improve ﬂow estimationaccuracy, in the training phase, we ﬁnetune the pretrainedoptical ﬂow network with the rain video frames and the es-timated background layers. After the warping, these framesshould be well aligned to the current ones, expressed as:LFlow =t−1�i=t−s��IIi→t − It��22 +t+s�i=t+1��IIi→t − It��22+t−1�i=t−s�� BIi→t − ˆBRt��22 +t+s�i=t+1�� BIi→t − ˆBRt��22 .(7)When the background layers are recovered, they providemore accurate information

下载后可阅读完整内容，剩余1页未读，立即下载