arXiv:2102.08168v2 [cs.CV] 7 Jan 2022
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY. 1
Just Noticeable Difference for Deep Machine Vision
Jian Jin, Member, IEEE, Xingxing Zhang, Xin Fu, Huan Zhang,
Weisi Lin, Fellow, IEEE, Jia n Lou, Yao Zhao, Senior Member, IEEE
Abstract—As an important perceptual characteristic of the
Human Visual System (HVS), the Just Noticeable Difference
(JND) has been studied for decades with image and video
processing (e.g., perceptual visual signal compression). However,
there is little exploration on the existence of JND for the D eep
Machine Vision (DMV), although the DMV has made great
strides in many machine vision tasks. In this paper, we take
an initial attempt, and demonstrate that the DMV has the JND,
termed as the DMV-JND. We then propose a JND model for
the image classification task in th e DMV. It has been discovered
that th e DMV can tolerate distorted images with average PSNR
of only 9.56dB (the lower the better), by generating JND via
unsupervised learning with the proposed DMV-JND-NET. In
particular, a semantic-guided redundancy assessment strategy
is designed to restrain the magnitude and spatial distribution
of the DMV-JND. Experimental results on image classification
demonstrate that we successfully find the JND for deep machine
vision. Our DMV-JND facilitates a possible direction for DMV-
oriented image and video compression, watermarking, quality
assessment, deep neural network security, and so on.
Index Terms—Just noticeable difference (JND), human visual
system (HVS), deep machine vision (D MV), image classification,
class activation mapping (CAM)
I. INTRODUCTION
T
HE unique psychological and physiological mech a nisms
of the Hum a n Visual System (HVS) make humans unab le
to perceive certain changes in images and videos. This is du e to
its underlying spatial-temporal sensitivities and masking p rop-
erties [1]. That is, images and vid eos have visual redunda ncy
for the HVS. The HVS oriented Just Noticeable Difference
(JND), termed as the HVS-JND, refers to find the maximum
visual threshold of each pixel. A ny changes under the thr esh-
old can be tolerated by the HV S. Commonly, this kind of
property of JND is regarded as the homogeneous property,
which exists in human perception, such as vision, hearing,
smell, touch, taste, and so on. All ch anges below JND form
a homogeneous range that leads to the same perception. The
This work was supported by Alibaba Group through Alibaba Innovative
Research (AIR) Program and Alibaba-NTU Singapore Joint Research Institute
(JRI), Nanyang Technological University, Singapore. (Corresponding author:
Weisi Lin.)
J. Jin, H. Zhang, and W. Lin are with the School of Computer Science
and Engineering, Nanyang Technological University, 639798, Singapore.
J. Jin and W. Lin are also with Alibaba-NTU Singapore Joint Research
Institute, N anyang Technological University, 639798, Singapore. E-mail:
jian.jin@ntu.edu.sg, huan.zhang@siat.ac.cn, wslin@ntu.edu.sg.
X. Zhang is with the Department of Computer Science and
Technology, Tsinghua University, Beijing 100084, China. E-mail:
xxzhang2020@mail.tsinghua.edu.cn.
X. Fu and Y. Zhao are with the Institute of Information Science, Beijing
Jiao Tong University, Beijing 100044, China, and also with the Beijing
Key L aboratory of Advanced Information Science and Network Technology,
Beijing 100044, China. E-mail: {xinfu and yzhao}@bjtu.edu.cn.
J. Lou is with the Alibaba cloud business group, department of video cloud,
Alibaba, Hangzhou 310052, China. Email: jianedwardlou@gmail.com.
+
+
Classifiers
Classifiers
100%
Fig. 1. The Relative Classification Accuracy (RCA) comparison between
DMV-JND distorted image and White Gaussian Noise (WGN) distorted
image. After adding DMV-JND (generated via our proposed DMV-JND
model) and WGN (with same amount of noise) to the original image, we
get 100% and 15.55% RCA on the CIFAR-10 dataset, respectively.
homogeneous pr operty reflects the characteristics in sensitivity
of the human perception, which makes the HVS-JND being
widely used in image and video processing, such as perceptual
visual signal compression [1], quality-of-experience (QoE) in
video stre a ming service [2], watermarking [3], error resilience
[4], supper reso lution [5], graphic rendering [6], and so on.
With massive data and high-performance GPU har dware,
Deep Machine Vision (DMV) has made breakthroughs in
many machine vision tasks, su ch as image classification [7],
object detection [8], person re-identification [9], and so on.
It also makes the ultimate receiver and appreciator of in-
creasingly larger number of images and videos change from
the HVS to the DMV. Many images and videos processing
applications are developed for the DMV now, and we naturally
wonder: does the DMV have the JND? Unlike the HVS-JND
aiming to find the visual redundancy for the HVS, the JND
for the D MV is to find the redundancy of images and vide os
for dee p m achine vision by considering the effects of such
redundancy during the DMV tasks. If the DMV has JND, the
JND for the DMV will greatly be nefit the DMV-oriented v isu al
computing applications. For instance, it would help to design
novel codecs for DMV-oriented image and video compre ssion
[10] via a DMV-JND inspired bit allocation strategy. For
example, the lower bit is assigned to pixels with higher
redundancy for the DMV, while the higher bit is assigned
to pixels with lower redundancy so as to ac hieve overall bit
saving. Besides, it may provide us a novel perspective for a
wider scope, e.g., DMV-oriented quality evaluation for natu ral