Advances in Multimedia
= 116
𝑤
−16,
()
= 500
𝑤
−
𝑤
,
()
= 200
𝑤
−
𝑤
,
()
where R, G, and B are the red, green, and blue component of a
pixel, X, Y, and Z are the CIE XYZ tristimulus values (), and
L, A, and B ((), (), and ()) are color lightness, chromaticity,
and coordinates, respectively. Xw=., Yw=., and
Zw=. are the CIE XYZ tristimulus values of the
reference white point, and f(t) is calculated by the following
rule:
(
)
=
1/3
, > 0.008856
7.787+
16
116
,,
()
and the L component is then taken for image representa-
tion (Figure (c)). Integer Wavelet Transform (IntWT) is an
approximation of original image and is more robust against
signal processing attacks. erefore, we nally apply one-
level IntWT to the L component and take the low frequency
subband (LL) as the semantic perceptual image (Figure (d)),
from which multiple types of feature are extracted for hash
generation.
2.2. Hashing Learning. Suppose there are images in the
given whole set, represented as ={x
𝑖
}, =1,2,...,,where
x
𝑖
∈ R
𝐷
represents feature vector. For each image, we extract
their types of features. e task of multiview perceptual
image hashing is to learn hash functions by simultaneously
utilizing the feature matrices X
(1)
,X
(2)
,...,X
(𝑉)
,withX
(V)
=
[x
(V)
1
,x
(V)
2
,...,x
(V)
𝑛
]corresponding to the V-type of feature
matrix. Let X ={X
1
: X
2
:⋅⋅⋅:X
𝑛
} denote the combined
matrix for multiview feature, where X ∈R
𝐷×𝑛
, =∑
𝑉
V=1
V
,
and
V
is the dimension of V- type feature. e goal of
our algorithm is to learn hash functions that map X ∈
R
𝐷×𝑛
to a compact representation B
𝐾×𝑛
in a low-dimensional
Hamming space, where is the digits length.
In the set ,thereare labeled images, ,which
are associated with at least one of the two categorizes M
and C.Specically,apair(x
𝑖
,x
𝑗
)∈M is denoted as
perceptually similar pair when (x
𝑖
,x
𝑗
) are the images that
have been under content-preserved un-malicious distortions
and attacks. (x
𝑖
,x
𝑗
)∈C is denoted as perceptually dissimilar
pair when two samples are the original image and the one
that is suered from malicious manipulations or perceptually
signicant attacks such as object insertion and removal. Let
us denote the feature matrix formed by the corresponding
columns of X as X
𝑙
∈ R
𝐷×𝑙
. Note that the feature matrices are
normalized to zero-centered.
We dene the perceptual condence measurement for
each image example. e matrix S ∈ R
𝑙×𝑙
incorporating the
pairwise labeled information from X
𝑙
,
𝑖𝑗
is the pairwise
relationship for (x
𝑖
,x
𝑗
), which is dened as
S
𝑖𝑗
=
1x
𝑖
,x
𝑗
∈M
−1 x
𝑖
,x
𝑗
∈C
0.
()
Supposewewanttolearnhash functions that leading to
a -digit representation B of X.Foreachdigit=1,2,...,,
its hash function is dened as
𝑘
x
𝑖
=w
T
𝑘
x
𝑖
,
()
where w
𝑘
∈ R
𝐷
is the coecient vector. Let W =[w
1
,w
2
,
...,w
𝑘
]∈R
𝐷×𝐾
and the representation B of the feature
matrix X for image set is
B = W
T
X.
()
Our goal is to learn a W that is simultaneously maximizing
the empirical accuracy on the labeled image and variance
of hash bits over all images. e empirical accuracy on the
labeledimageisdenedas
1
(
W
)
=
𝑘
(x
𝑖
,x
𝑗
)∈M
S
𝑖𝑗
𝑘
x
𝑖
𝑘
x
𝑗
+
(x
𝑖
,x
𝑗
)∈C
S
𝑖𝑗
𝑘
x
𝑖
𝑘
x
𝑗
.
()
e objective function for empirical accuracy can be repre-
sented as
1
(
W
)
=
1
2
tr W
T
X
𝑙
S W
T
X
𝑙
T
.
()
en, the empirical accuracy
1
(W)is presented as
1
(
W
)
=
1
2
tr
W
T
X
𝑙
SX
T
𝑙
W
.
()
Moreover, to maximize the information provided by each bit,
the variance of hash bits over all data X is also measured and
taken as a regularization term:
(
W
)
=
𝑘
var
𝑘
(
X
)
=
𝑘
var w
T
𝑘
X.
()
Maximizing the above function with respect to W is still hard
due to its nondierentiability. As the maximum variance of
a hash function is lower bounded by the scaled variance of
the projected data, the information theoretic regularization
is represented as
2
(
W
)
=
1
2
tr W
T
XW
T
X
T
.
()
Finally, the overall semi-supervised objective function
combines the relaxed empirical tness term from () and