基于深度学习的视觉敏感性特征的文件研究

133 浏览量更新于2023-10-15 收藏 12.71MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

Deep Learning of Human Visual Sensitivity in Image Quality AssessmentFrameworkJongyoo KimSanghoon Lee∗Department of Electrical and Electronic Engineering, Yonsei Universiy, Seoul, Korea{jongky, slee}@yonsei.ac.krAbstractSince human observers are the ultimate receivers of dig-ital images, image quality metrics should be designed froma human-oriented perspective. Conventionally, a numberof full-reference image quality assessment (FR-IQA) meth-ods adopted various computational models of the humanvisual system (HVS) from psychological vision science re-search. In this paper, we propose a novel convolutional neu-ral networks (CNN) based FR-IQA model, named Deep Im-age Quality Assessment (DeepQA), where the behavior ofthe HVS is learned from the underlying data distribution ofIQA databases. Different from previous studies, our modelseeks the optimal visual weight based on understanding ofdatabase information itself without any prior knowledge ofthe HVS. Through the experiments, we show that the pre-dicted visual sensitivity maps agree with the human subjec-tive opinions. In addition, DeepQA achieves the state-of-the-art prediction accuracy among FR-IQA models.1. IntroductionPredicting perceptual quality is the main goal of imagequality assessment (IQA), which is applied in the wide ﬁeldof image processing such as process evaluation, image andvideo encoding, and monitoring. Since human observers arethe ultimate receivers of digital images and videos, qualitymetrics should be designed from a human-oriented perspec-tive. Therefore, a great deal of effort has been made to de-velop IQA methods based on the analysis of the propertiesand mechanism of the human visual system (HVS).When a distorted image is perceived by HVS, some errorsignals are emphasized and some others are masked. Figs.1(a) and (b) show a distorted image by JPEG and its objec-tive error map. The distortions around the houses and on the∗Corresponding author. (E-mail: slee@yonsei.ac.kr)This work was supported by the ICT R&D program of MSIP/IITP.[2017-0-00289, Development of a method for regulating human-factor pa-rameters for reducing VR-induced sickness](a)(b)(c)Figure 1. Examples of predicted sensitivity maps: (a) is a distortedimage; (b) is an objective error map; (c) is a predicted perceptualerror map. Darker regions in (b) indicate more pixel-wise distortedpixels, and those in (c) indicate perceptually more distorted ones.sky regions are easily observable. However, those on textu-ral regions (e.g. rocks) are less noticeable though there area lot of pixel-wise distortions as shown in Fig. 1(b). There-fore, simple pixel-wise metrics such as peak signal-to-noiseratio (PSNR) and mean squared error (MSE) do not corre-late well with perceived quality. To conduct reliable IQA,it is necessary to understand the human visual sensitivitywhich explains the perceptual impact of artifacts accordingto a spatial characteristic of pixels.Based on these observations, many full-reference imagequality assessment (FR-IQA) methods have adopted vari-ous computational models of the HVS from psychological1676vision science [4], and made assumptions of the HVS’s be-havior to predict perceptual quality [32, 34, 35]. However,since the majority of the HVS models are complex and weredesigned in a limited and reﬁned condition, it is difﬁcult toassure the best performance by generalizing the HVS mod-els to the practical IQA problem.Recently, convolutional neural networks (CNN) havebeen the widely used in computer vision [11]. Beyondthe classiﬁcation framework, CNNs have been successfullyused to generate image maps such as semantic segmenta-tion map [16] and depth map [6]. Inspired by these works,we use the CNN to generate a visual sensitivity map whichrefers to a weighting map of describing the degree of visualimportance of each pixel to the HVS.The CNN model in our approach is dedicated to learnthe HVS properties. Based on the objective error map, themodel seeks the visual weight of each pixel. The predictedvisual sensitivity map allocates local weights to the pixelsaccording to their local spatial characteristics of the dis-torted images. This approach is similar to weighted poolingstrategies adopted in FR-IQA methods [32, 35]. However,different from the previous works, our model ﬁnds a visualweight without any prior knowledge of the HVS, but relyingonly on the dataset: a triplet of a distorted image, its objec-tive error map, and its ground-truth subjective score. Fig.1(c) shows an example result of the proposed model. Thedark regions indicate perceptually distorted pixels. Com-pared to the objective error map in (b), it is obvious that (c)emphasizes the visible distortions such as coding artifactsabove the sea and around the house.We name the proposed method as Deep Image QualityAssessment (DeepQA). Our contributions can be summa-rized as follows.167701.DeepQA学习了HVS的视觉敏感性特征，没有任何先验知识。通过使用深度CNN，我们使用扭曲图像、其客观误差映射和其基准主观分数的三元组来寻找每个像素的视觉权重。02.DeepQA可以生成感知误差映射作为中间结果，为给定的扭曲图像提供直观的局部伪影分析。03.提出了一种基于新颖的深度CNN的FR-IQA框架。我们的模型可以通过端到端优化进行训练，并与人类主观分数达到最先进的相关性。02. 相关工作02.1. 人类视觉敏感性0背景亮度决定了HVS的敏感性，这与韦伯-费希纳定律类似。对比度敏感函数（CSF）是指HVS根据图像的空间频率变化而变化的敏感性。纹理的存在会降低图像失真的可见性，这被称为对比度掩蔽。相反，均匀区域上的编码伪影更容易被观察到。为了模拟视觉皮层中的图像表示，采用了子带分解模型，如Gabor滤波器和可操纵金字塔。基于这些观察，许多FR-IQA度量标准已经被开发出来。FR-IQA有两种一般策略：自下而上和自上而下框架。前者直接利用计算模型模拟HVS中的各种处理阶段，如前文所述。后者则根据对HVS的假设设计IQA的整体功能。例如，结构相似性指数（SSIM）假设对比度和结构失真对HVS至关重要。特征相似性指数（FSIM）假设相位一致性是HVS感知的主要特征。02.2. 基于机器学习的IQA方法0在无参考图像质量评估（NR-IQA）中，主要采用了机器学习。由于NR-IQA中没有参考图像，研究人员尝试设计精心的特征，可以区分扭曲图像和原始图像。其中一种流行的特征是自然场景统计（NSS）的一系列特征，它假设自然场景包含统计规律性。除了NSS之外，还为NR-IQA开发了各种类型的特征。另一方面，机器学习在FR-IQA中部分采用。在[22]中，采用奇异值分解特征，并使用支持向量回归（SVR）将其回归到质量分数上。多方法融合（MMF）提出使用机器学习将多个现有的FR-IQA方法结合起来，以达到最先进的准确性。在[23]中，从高斯差频带中提取了多个特征，并将其回归到质量分数上。相对较近的是，最近有尝试将深度学习应用于NR-IQA问题。Hou等人使用深度信念网络，其中提取了小波NSS特征并输入到深度模型中。Kang等人首次将CNN应用于NR-IQA问题，而没有使用任何手工特征。Kim和Lee描述了一个基于两阶段CNN的NR-IQA模型，其中由FR-IQA方法生成的局部质量分数被用作代理补丁标签。Liang等人提出了一种基于双通道CNN的FR-IQA模型，其中还处理了与参考图像相似的非对齐图像。646464132321132116780扭曲图像0错误映射0错误映射0连接0主观评分 1 40∑0Conv1-2 Conv2-2 (步长=2)0Conv3 Conv4 (步长=2) Conv5 Conv60FC1 FC2 640Conv1-1 Conv2-1 (步长=2)0图2. DeepQA的架构。模型以失真图像和误差图作为输入，并生成敏感度图。在与误差图相乘后，它被回归到主观分数。03. 敏感度图预测03.1. 架构0在确定人类视觉敏感度时，最直观的方法是比较误差信号和背景信号的能量，其中误差信号表示客观误差图，背景信号表示参考图像。然而，在现实世界中，人类视觉系统只能观察到一张失真的图像，而无法知道误差信号。因此，在本文中我们测试了两种情况。首先，DeepQA同时接受失真图像和误差图作为输入。其次，DeepQA–s是一个更简单的版本，只使用失真图像作为输入。DeepQA的架构如图2所示。DeepQA–s只包含从输入失真图像（Conv1-1和Conv2-1）开始的分支，没有连接层。受最近的工作[30]的启发，两个具有3×3滤波器的深度卷积网络用于两个模型。为了生成一个不丢失像素位置信息的敏感度图，模型只包含卷积层。在DeepQA中，失真图像和误差图经过不同的卷积层，然后在第二个卷积层之后进行连接。为了在卷积操作后保持特征图的大小，每个卷积之前都在边界周围填充了零。图2显示了两个用于下采样的步长卷积。因此，最终输出是原始输入图像的1/4，相应地，地面实况误差图按1/4缩小。在DeepQA–s和DeepQA中，除了Conv6之外的每个卷积层都采用了泄漏整流线性单元（LReLU）[18]。Conv6层采用了整流线性单元（ReLU）[21]，因为权重是正实数。此外，0Conv6的偏置初始化为1。在模型的末尾，使用两个全连接层将特征回归到主观分数。这里，隐藏层和输出层分别使用LReLU和ReLU。03.2. 图像归一化0在将失真图像输入CNN之前，简单地对其进行归一化。设 Ir 为参考图像， I d为失真图像。我们首先将它们转换为灰度图像，并将它们重新缩放到范围[0,1]。然后从它们的低通滤波版本中减去。归一化后的图像用ˆ I r 和 ˆ I d表示。这是因为人类视觉系统对低频带的变化不敏感。对比敏感度函数（CSF）显示出一个在每度约4个周期处达到峰值的带通滤波器形状，并且在低频时敏感度迅速下降[4]。03.3. 敏感度图预测0为了使模型学习生成敏感度图，我们利用了一组失真图像、其客观误差图和相应的地面实况分数。敏感度图不仅仅是对客观误差图进行平均，而是根据人类视觉系统对每个像素进行加权。最后，我们首先使用归一化的对数差函数定义客观误差图，如下所示：0e = log (1 / (( ˆ I r − ˆ I d ) 2 +ε/ 255 2 ))0log (255 2 /ε ) (1)0其中 ε = 1 用于实验。视觉敏感度图是从CNN中获得的s1 = CNN1(ˆId; θ1)(2)s2 = CNN2(ˆId, e; θ2)(3)16790模型0其中CNN 1 ( ∙ )和CNN 2 ( ∙)分别表示DeepQA–s和DeepQA的CNN模型，参数为θ1和θ 2。然后，感知误差图由p = s ⊙e定义，其中⊙表示Hadamard乘积，s为s 1或s2。由于我们在每个卷积之前填充了零，靠近边界的特征图往往为零。为了减轻这个问题，我们忽略了感知误差图周围边界附近的像素。在实验中，每个边界的四行和四列都被排除在外，这可以部分补偿信息损失。因此，汇总得分通过对裁剪后的感知误差图进行平均来得到：0µ p = 10(H - 8) ∙ (W - 8)0(i,j) ∈ ω p (4)0其中H和W分别是p的高度和宽度，(i,j)表示像素索引，ω表示裁剪区域。由于无法保证池化得分与主观评分之间具有线性关系，所以使用额外的全连接层进行非线性回归。然后，最终的目标函数定义为0L s (ˆ I d ; θ) = ∥ (f(µ p) - S) ∥ 2 F (5)0其中f(∙)是非线性回归函数，S是输入失真图像的真实主观评分，θ是DeepQA-s的θ1或DeepQA的θ2。03.4. 总变差正则化0当模型被优化以最小化（5）而没有任何约束时，它会生成类似高频噪声的敏感度图。为了避免这种情况，对敏感度图应用平滑约束。我们采用总变差（TV）L2范数，因为它可以在CNN的优化过程中惩罚敏感度图的高频分量。与[19]类似，我们将TV正则化定义为0TV(s) = 10H ∙ W0(i,j) (sobel h(s)^2 + sobel v(s)^2)β/2(6)0其中H和W分别是预测的敏感度图的高度和宽度，sobelh和sobelv分别是水平和垂直方向上的Sobel操作，β在实验中取3。03.5. 训练方法0为了优化的更好收敛，使用自适应矩估计优化器（ADAM）[10]与Nesterov0动量[5]被用于改变常规随机梯度下降方法。学习率最初设置为5 ×10-4。为了在回归损失（5）和TV正则化（6）之间取得平衡，我们分别将它们乘以10^3和10^-2。TV正则化的效果在第4.2节中进一步讨论。此外，对所有层应用L2正则化（L2惩罚乘以5 × 10^-3）。03.5.1 基于补丁的方法0为了在GPU上训练DeepQA，需要固定输入图像的尺寸。因此，为了使用LIVEIQA数据库[27]训练模型，该数据库包含各种尺寸的图像，我们将输入图像分割成固定尺寸的补丁。在这里，当重建感知误差图时，需要避免重叠区域。因此，滑动窗口的步长由步长补丁 = 补丁尺寸 - (忽略像素数 × 2 ×R)决定，其中忽略像素数Nign，输入和感知误差图的尺寸比例为R。在使用LIVEIQA数据库进行实验时，忽略像素为4，补丁尺寸为112 ×112，滑动步长为80 ×80。此外，在训练阶段，应该将构成一个图像的所有补丁包含在同一个小批量中，以便可以从重建的感知误差图中推导出µ p（4）。04. 实验结果04.1. 数据集0使用了四个不同的IQA数据库来评估所提出的算法：LIVEIQA [27]，CSIQ [12]，TID2008 [25]和TID2013[24]。LIVEIQA数据库包含29个参考图像和982个失真图像，其中包括JPEG和JPEG2000（JP2K），加性白高斯噪声（WN），高斯模糊（BLUR）和雷利快速衰落信道失真（FF）五种失真类型。CSIQ数据库包括30个参考图像和866个失真图像，其中包括JPEG，JP2K，WN，GB，粉色高斯噪声（PGN）和全局对比度降低（CTD）六种失真类型。TID2008包含25个参考图像和1700个失真图像，其中包含17种不同的失真类型和四个降级级别，而TID2013则扩展到包含24种失真类型和五个降级级别。在实验中，将真实主观评分重新缩放到[0,1]范围内。对于差分平均意见分数（DMOS）值（在LIVEIQA和CSIQ中），将尺度反转，以便较大的值表示感知上更好的图像。为了评估IQA算法的性能，我们使用了两个标准指标，即Spearman等级相关系数（SRCC）和Pearson线性相关系数（PLCC），按照[1]的方法进行计算。(a)(b) wT V = 10−4(c) wT V = 10−3(d) wT V = 10−2(e) wT V = 10−1(f)(g) wT V = 10−4(h) wT V = 10−3(i) wT V = 10−2(j) wT V = 10−1Figure 3. Examples of the predicted sensitivity maps with various TV regularization weights: (a) and (f) show distorted images; (b) - (e)and (g) - (j) are their predicted sensitivity maps with different TV regularization weights.Epochs1020304050607080SROCC0.9650.970.9750.980.9851e-41e-31e-21e-1(a)Epochs1020304050607080LCC0.930.940.950.960.970.980.991e-41e-31e-21e-1(b)Figure 4. Comparison of SRCC and PLCC curves according to thedegree of TV regularization weight.4.2. Effects of TV RegularizationTo analyze the effects of TV regularization, we testedwith four different weights (wT V= 10−4, 10−3, 10−2,10−1) for the TV regularization during training. Fig. 3shows the predicted sensitivity maps with different degreesof TV regularization weight. When the weight was verysmall, the sensitivity map was too detailed, which does notagree with the HVS well. As the weight wT V increased(from (b) to (e), and from (g) to (j)), the sensitivity maptended to be smoother. In addition, it became clearly dis-tinguishable between the perceptually less and more dis-torted regions as shown in (e). However, since the TV reg-ularization promotes piecewise smoothing, black spots alsoincreased as shown in (e).To check if the TV regularization affects the predictionaccuracy, SRCC and PLCC over 80 epochs with the foursettings are drawn in Figs. 4(a) and (b). SRCC and PLCCwere obtained from the testing set every 2 epochs. WhenwT V = 10−4, the SRCC and PLCC were slightly lowerthan the others, but there were no signiﬁcant differences be-tween the different degrees of TV regularization term.4.3. Sensitivity Map PredictionTo validate if DeepQA agrees with the HVS, the pre-dicted sensitivity maps and the perceptual error maps areshown in Fig. 5. Here, DeepQA was trained with wT V =10−2. The distorted images with four different artifact types(JPEG2000, JPEG, WN, and GB) are shown in (a), (e), (i),and (m). Figs. (b), (f), (j), and (n) are the objective errormaps obtained from (1), Figs. (c), (g), (k), and (o) are thepredicted sensitivity maps, and Figs. (d), (h), (l), and (p) arethe perceptual error maps. The darker regions indicate moredistorted pixels. In (a), the distortion around the houses wasmore noticeable than that on the rocks, as shown in (d). ForJPEG distortion, the banding artifact on the sky regions wasemphasized in (h). In case of additive white noise, the ob-jective error was uniformly distributed over the image, asshown in (j). In the perceptual error map, the distortion onthe homogeneous regions was more noticeable than that onthe textural regions, as shown in (l), which agrees with thecontrast masking and CSF. When the image was distortedby Gaussian blur, strong edges were especially distorted as(n), and the perceptual error map also had similar tendency,as shown in (p).1680(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)Figure 5. Examples of the predicted sensitivity maps; (a), (e), (i), and (m) are distorted images with JEPG2000, JPEG, white noise, andGaussian blur; (b), (f), (j), and (n) are the objective error maps; (c), (g), (k), and (o) are the predicted sensitivity maps; (d), (h), (l), and (p)are the perceptual error maps.16810图6显示了不同失真级别下白噪声和高斯模糊的感知误差图.第一行表示白噪声，第二行表示高斯模糊.当存在强烈的白噪声时，感知误差图会丢失结构细节，如（e）所示.相反，当高斯模糊增加时，强边缘上的失真更加突出.一般来说，随着失真程度的增加，感知误差图的平均值会减小，这表明DeepQA根据失真程度进行了合理的质量预测.04.4. 性能比较0为了评估DeepQA的性能，我们将参考图像随机分成两个子集（80%用于训练，20%用于测试），并分别对应它们的失真图像.0失真图像以相同的方式进行划分，以确保两个集合之间没有重叠.为了增加训练样本的数量，还额外补充了水平翻转的图像.DeepQA以非失真特定的方式进行训练，即同时使用所有失真类型.训练迭代了80个周期，然后选择验证误差最低的模型.如图4所示，预测准确率在50个周期后基本饱和.DeepQA的相关系数在随机分割训练和测试集的过程中重复了20次，以消除性能偏差.0DeepQA与八个FR-IQA指标进行了比较：PSNR，SSIM[32]，MS-SSIM [33]，VIF [28]，GMSD [34]，FSIMc[35]，DOG-SSIMc [17]和FR-DCNN [14].此外，还对比了四个基于深度学习的NR-IQA方法.(a) 0.8037(b) 0.7125(c) 0.4260(d) 0.3657(e) 0.0125(f) 0.7361(g) 0.5696(h) 0.5299(i) 0.4816(j) 0.3630TypeSRCCPLCCSRCCPLCCSRCCPLCCSRCCPLCCSRCCPLCCFRPSNR0.8760.8720.8060.8000.5530.5730.6360.7060.6660.704SSIM0.9480.9450.8760.8610.7750.7730.6370.6910.7450.767MS-SSIM0.9510.9490.9130.8990.8540.6570.7860.8330.8420.809VIF0.9630.9600.9200.9280.7490.8080.6770.7720.7650.826GMSD0.9600.9600.9570.9540.8910.8790.8040.8590.8670.890FSIMc0.9600.9610.9310.9190.8840.8760.8510.8770.8840.893DOG-SSIMc0.9630.9660.9540.9430.9350.9370.9260.9340.9370.940FR-DCNN0.9750.977--------DeepQA0.9810.9820.9610.9650.9470.9510.9390.9470.9490.955NRSESANIA0.9340.948--------CNN0.9560.953--------Patchwise0.9600.972----0.8350.855--BIECON0.9580.960--------16820图6. 不同失真级别下白噪声和高斯模糊的感知误差图示例：(a) - (e)为白噪声失真；(f) - (j)为高斯模糊失真. 值表示感知误差图的平均值 ( µ p).0表1. 四个IQA数据库上的SRCC和PLCC比较. FR (NR)表示全参考 (无参考) 模型，斜体表示基于深度学习的方法.0LIVE IQA CSIQ TID2008 TID2013 加权平均0DeepQA-s 0.977 0.975 0.957 0.956 0.878 0.892 0.766 0.818 0.848 0.8760标记：SESANIA [7]，CNN[8]，[2]中的“Patchwise”方法和BIECON [9].在表1中，比较了FR-IQA算法在LIVEIQA和TID2008上的SRCC和PLCC.在最后一列中，报告了四个数据库上SRCC和PLCC的加权平均值，其中每个权重与每个数据库的失真图像数量成比例.每个评估准则的前三个模型以粗体显示.基于深度学习的模型的报告的SRCC和PLCC分数来自原始论文.当考虑所有失真类型时，DeepQA取得了最高的SRCC和PLCC，其次是DOG-SSIMc.DeepQA在所有数据库上都优于其他指标，而DeepQA-s仅在LIVE IQA和CSIQ数据库上取得了竞争性能. 从这个0观察表明，将误差图作为输入有助于CNN模型提取更有用的特征，从而提高准确性。表2显示了根据LIVEIQA和TID2008数据库中各个失真类型的SRCC比较。即使每种失真类型分别进行测试，DeepQA在大多数失真类型上通常都能取得竞争性的准确性。由于DeepQA使用的是灰度图像，在局部块状失真（Block）中，颜色变化是失真的一个重要线索，因此其性能较低。此外，由于归一化过程丢弃了低频成分，DeepQA在均值偏移（MS）上的相关性较低，因为全局亮度发生了一致性的变化。总体而言，DeepQA在所有数据库上都取得了竞争性和一致的准确性。References[1] Final report from the video quality experts group on thevalidation of objective models of video quality assessment,phase II. VQEG, 2003.[2] S. Bosse, D. Maniry, T. Wiegand, and W. Samek. A deepneural network for image quality assessment. In IEEE In-ternational Conference on Image Processing (ICIP), pages3773–3777, 2016.[3] C.-H. Chou and Y.-C. Li. A perceptually tuned subband im-age coder based on the measure of just-noticeable-distortionproﬁle. IEEE Trans. Circuits Syst. Video Technol., 5(6):467–476, 1995.16830表2. LIVE IQA和TID2008数据库上各个失真类型的SRCC比较。斜体表示基于深度学习的方法。0失真类型 PSNR SSIM MS-SSIM VIF GMSD FSIMc DeepQA–s DeepQA0LIVE IQA JP2K 0.895 0.961 0

下载后可阅读完整内容，剩余1页未读，立即下载