To reduce from an order of 10 minutes to a second (which
represents a three order-of-magnitude cutback), substantial
reduction in both the feature complexity and modeling
complexity must be accomplished while maintaining a
reasonable level of accuracy for practical use in online tasks.
The integration of region segmentation and representative
color and texture extraction from the segments is a suitable
time-reduction strategy; however, sophisticated region seg-
mentation methods themselves are often not in real time.
Borrowing from the experiences gained in large-scale visual
similarity search, we use a fast image segmentation method
based on wavelets and K-Means clustering [32].
The low complexity of this segmentation method makes
it an attractive option for processing large amounts of
images. Unfortunately, this method is tuned toward
recognizing scenes, and thus, we expect its insufficiency
for recognizing individual objects, given the great variations
a type of objects (for example, dogs) can appear in pictures.
Although object names are often assigned by the system, the
selection is mostly based on statistical correlation with
scenes. On the other hand, as pointed out by one reviewer,
different levels of performance may be possible under a
more controlled image set, such as various types of the
same object or images of the same domain. We will explore
this in the future.
After the region-based signatures are extracted from the
pictures, we encounter the essential obstacle: the segmenta-
tion-based signatures are unordered and of arbitrary lengths
across the picture collection, primarily because the number of
regions used to represent a picture often depends on how
complicated the composition of the picture is. No existing
statistical tools can handle the modeling in this scenario. The
key challenge to us, therefore, is to develop new statistical
methods for signature modeling and model matching when
the signatures are in the form of discrete distributions. The
details on these are provided in the following sections.
2.3 Image Signature
To form the signature of an image, two types of features are
extracted: color and texture. To extract the color part of the
signature, the RGB color components of each pixel are
converted to the LUV color components. The 3D color vectors
at all the pixels are clustered by K-Means. The number of
clusters in K-Means is determined dynamically by thresh-
olding the average within cluster distances. Arranging the
cluster labels of the pixels into an image according to the pixel
positions, we obtain a segmentation of the image. We refer to
the collection of pixels mapped to the same cluster as a region.
For each region, its average color vector and the percentage of
pixels it contains with respect to the whole image are
computed. The c olor signature is thus formulated as a
discrete distribution fðv
ð1Þ
;p
ð1Þ
Þ; ðv
ð2Þ
;p
ð2Þ
Þ; ...; ðv
ðmÞ
;p
ðmÞ
Þg,
where v
ðjÞ
is the mean color vector, p
ðjÞ
is the associated
probability, and m is the number of regions.
We use wavelet coefficients in high-frequency bands to
form texture features. A Daubechies-4 wavelet transform [9]
is applied to the L component (intensity) of each image. The
transform decomposes an image into four frequency bands:
LL, LH, HL, HH. The LH, HL, and HH band wavelet
coefficients corresponding to the same spatial position in the
image are grouped into one 3D texture feature vector. If an
image contains n
r
n
c
pixels, the total number of texture
feature vectors is
n
r
2
n
c
2
due to the subsampling of the
wavelet transform. When forming the texture features, the
absolute values of the wavelet coefficients are used. K-Means
clustering is applied to the texture feature vectors to extract
the major modes of these vectors. Again, the number of
clusters is decided adaptively by thresholding the average
within cluster distances. Similar to color, the texture signature
is cast into a discrete distribution.
Although we only involve color and texture in the current
image signature, other types of image features such as shape
and salient points can also be formulated into discrete
distributions, that is, bags of weighted vectors. For instance,
bags of SIFT features [17] are used to characterize and
subsequently detect advertisement logos in video frames [1].
As expected, our current image signature is not sensitive to
shape patterns. We choose to use color and texture features
because they are relatively robust for digital photos generated
by Internet users. Shape or salient point features may be more
appealing for recognizing objects. However, these features
are highly prone to corruption when the background is noisy,
object-viewing angle varies, or occlusion occurs, as is usually
the case. Moreover, semantics of an image sometimes cannot
be adequately expressed by a collection of object names.
Deriving image signatures that are robust and strong for
semantic recognition is itself a deep research problem that we
would like to explore in the future.
In general, let us denote images in the database by
f
1
;
2
; ...;
N
g. Suppose every image is represented by an
array of discrete distributions,
i
¼ð
i;1
;
i;2
; ...;
i;d
Þ. Denote
the space of
i;l
by
l
,
i;l
2
l
, l ¼ 1; 2; ...;d. Then, the space
of
i
is the Cartesian product space
¼
1
2
d
:
The dimension d of , that is, the number of distributions
contained in
i
, is referred to as the superdimension to
distinguish from the dimensions of vector spaces on which
these distributions are defined. For a fixed superdimension j,
the distributions
i;j
, i ¼ 1; ...;N, are defined on the same
vector space, R
d
j
, where d
j
is the dimension of the jth sample
space. Denote distribution
i;j
by
i;j
¼ v
ð1Þ
i;j
;p
ð1Þ
i;j
;v
ð2Þ
i;j
;p
ð2Þ
i;j
; ...;v
ðm
i;j
Þ
i;j
;p
ðm
i;j
Þ
i;j
no
; ð1Þ
where v
ðkÞ
i;j
2R
d
j
, k ¼ 1; ...;m
i;j
, are vectors on which the
distribution
i;j
takes positive probability p
ðkÞ
i;j
. The cardin-
ality of the support set for
i;j
is m
i;j
, which varies with both
the image and the superdimension.
To further clarify the notation, consider the following
example. Suppose images are segmented into regions by
clustering 3D color features and 3D texture features respec-
tively. Suppose a region formed by segmentation with either
type of features is characterized by the corresponding mean
feature vector. For brevity, suppose the regions have equal
weights. Since two sets of regions are obtained for each image,
the superdimensionality is d ¼ 2. Let the first superdimen-
sion correspond to color-based regions and the second to
texture-based regions. Suppose an image i has four color
regions and five texture regions. Then,
i;1
¼
n
v
ð1Þ
i;1
;
1
4
;
v
ð2Þ
i;1
;
1
4
; ...;
v
ð4Þ
i;1
;
1
4
o
;v
ðkÞ
i;1
2R
3
;
i;2
¼
n
v
ð1Þ
i;2
;
1
5
;
v
ð2Þ
i;2
;
1
5
; ...;
v
ð5Þ
i;2
;
1
5
o
;v
ðkÞ
i;2
2R
3
:
988 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 6, JUNE 2008
Authorized licensed use limited to: BEIJING UNIVERSITY OF TECHNOLOGY. Downloaded on May 23,2010 at 15:46:08 UTC from IEEE Xplore. Restrictions apply.