LU et al.: REAL-TIME, ADAPTIVE, AND LOCALITY-BASED GRAPH PARTITIONING METHOD FOR VIDEO SCENE CLUSTERING 1749
within a cluster and the distinction of shots between different
clusters. On the other hand, considering that the video data are
normally obtained and viewed sequentially, a sequential graph
partitioning approach is also devised for long video sequence,
with an aim to further reduce the computational complexity
without degrading much the clustering accuracy.
Preliminary versions of our related work have been reported
in [12] and [34]. This paper extends our contribution in both
the technical and evaluation parts. First, the scene likeness
is determined by a fixed threshold in [34] or determined by
minimizing the average probability of clustering error, in the
sense of maximum a posteriori probability in [12]. In this
paper, an adaptive approach of PGF method is proposed to
identify the shots that are visually similar to each individual
shot, which will be described in detail in Section II-A. Further-
more, a sequential graph partitioning approach combined with
the local PGF method for processing long video sequences is
proposed. The proposed method is also compared with other
methods such as k-means clustering in [34] and normalized
cuts method in [12] and [34]. The comparison is now based on
the Minkowski measure considering all obtained clusters. The
computational complexity analysis of the proposed method is
also compared with other existing methods.
The rest of this paper is organized as follows. In Section II,
we present the PGF scheme to identify the similar shots of
each individual shot, and then construct the scene likeness
matrix of the video under analysis. Section III formulates the
clustering of video scenes as a graph partitioning problem. We
then compare the computational complexity of the proposed
approach against that of the conventional k-means clustering,
normalized cuts in Section IV, and describe the sequential
partitioning approach in Section V. Experimental simulations
and results are presented in Section VI, and this paper is
concluded in Section VII.
II. Scene Likeness of Shots
To group the shots of a given video into shot clusters with
disparate scenes, we first examine the scene likeness among
shots. To begin with, we segment the video into a sequence of
shots S = {S
1
,S
2
,...,S
M
}, say M of them, by using a popular
shot-boundary detector [36], [37]. Each shot is considered a
basic unit for processing, and its shot color histogram (SCH)—
the bin-wise median of all the frame color histograms within
the shot [38]—is computed as the shot feature for scene
clustering. A frame color histogram records the percentage of
each quantized color among all pixels in a frame. In this paper,
we compute the frame (and hence the shot) color histograms
in hue-saturation-value color space with its color coordinates
uniformly quantized into 12 (hue), 4 (saturation), and 4 (value)
bins, respectively, resulting in a total of 192 bins (or quantized
colors). The color space and quantization scheme are chosen
for their good performance, as also reported in other work of
content-based image and video analysis [39], [40].
Let X = {X
1
,X
2
,...,X
M
} denote the SCHs of the M
video shots. The visual similarity between two shots S
i
and
S
j
is judged based on the intersection of their SCHs, defined
as I(S
i
,S
j
)=
n
k=1
min(X
i
(k),X
j
(k)) [41], where n, equal
to 192 in this paper, is the total number of histogram bins.
However, owing to its limited discrimination capability, such
histogram intersection cannot provide a definite indication
on whether or not the two shots share a similar scene. For
example, shots featuring different scenes may have a non-
negligible histogram intersection because of common color
contents, while shots covering similar scenes may not have
a large histogram intersection due to occlusion or difference
between two camera positions or camera view angles. To
overcome this limitation, especially the limited number of
scenes appearing in sports and news videos and the similar
visual contents, we need to convert the histogram intersection
into a scene likeness indicator, L(S
i
,S
j
), evincing whether the
two shots have a similar scene. One intuitive way for this is
by comparing the histogram intersection against a predefined
threshold τ, referred to as scene likeness threshold, as follows:
L(S
i
,S
j
)=
1, I(S
i
,S
j
) ≥ τ
0, otherwise
(1)
where two shots are considered having a similar scene if
L(S
i
,S
j
)=1.
A threshold so chosen works under the premise that the
pertinent probabilities and density functions estimated from
the training set are good approximate to that of the videos to
be analyzed. In practice, this is often not the case because of
the large variation in video contents. As an example, Fig. 2
shows the range of the scene likeness threshold suitable for
identifying the similar shots (i.e., shots with similar scenes)
of each shot in a test Tennis video, where shots 1–46 and
shots 89–137 cover the tennis game, shots 47–88 cover
several commercials, and the ground truths of similar shots
are determined manually. We can observe that, even for a
single video, the suitable range of the scene likeness threshold
can vary from shot to shot. For instance, as the contents of
most commercial shots are very different from that of other
shots (hence their suitable thresholds span a wide range),
it is possible to obtain a fixed scene likeness threshold for
classifying commercial shots. This is, however, not the case
for shots covering the tennis game. As the tennis shots are
mainly taken by a fixed camera to expose the same players
and audience, they share similar color contents, and hence,
the suitable range of the scene likeness threshold for each
shot is much smaller than that of commercial shots, and may
not overlap with others. Furthermore, the optimal threshold
determined for the commercial shots is likely not suitable for
the tennis shots, and vice versa. Hence, much desired is an
approach that can determine the optimal threshold for each
shot based on its content. We propose such a scheme in the
next section.
A. Peer-Group Filtering
To obtain the optimal scene likeness threshold for each shot
S
i
, we propose using a PGF scheme to partition the shots
under comparison into two clusters, shots that are similar
to shot S
i
and shots that are not. The PGF scheme was
previously proposed by Deng et al. for color image enhance-
ment, quantization, and segmentation of texture regions and
obtained promising results [42], [43]. The main function of