Video Clustering
Aditya Vailaya, Anil K. Jain, and HongJiang Zhang
Abstract
We address the issue of clustering of video images. We
assume that video clips have been segmented into shots
which are further represented by a set of keyframes. Video
clustering is thus reduced to a clustering of still keyframe
images. Experiments with
8
human subjects reveal that hu-
mans tend to use semantic meanings while grouping a set
of images. A complete-link dendrogram constructed from
the similarities provided by the subjects revealed two sig-
nificant categories of images; that of city scenes and land-
scapes. A hierarchical clustering based on moments of
17
DCT coefficients of the JPEG compressed keyframe images
reveals that ad hoc low-level features are not capable of
identifying semantically meaningful categories in an image
database. It is well known that a clustering scheme will al-
ways find clusters in a data set! In order to define categories
that will aid in indexing and browsing of video data, fea-
tures specific to a given semantic class should be used. As
an example, we present initial results using multiple
2
-class
classifications. Our experiments have been conducted on
two databases of
98
and
171
images, respectively. Classi-
fiers for city/non-city shots, presence/absence of text in im-
ages, and presence of specific image textures (grass and sky)
are being developed.
1. Introduction
1.1. Motivation
Digital video libraries are generating tremendous inter-
est in pattern recognition, computer vision, and multime-
dia research communities. Powerful processors, high-speed
networking, high-capacity storage devices, improvements
in compression algorithms, and advances in processing of
audio, speech, image, and video signals are making digital
video libraries technically and economically feasible. The
largeamount of video data necessitates the need for efficient
schemes for navigating, browsing, searching, and viewing
video data. Traditional schemes allow textual descriptions
and annotations for a classification and indexing of video
clips. This requires painstaking manual efforts to pre-view
every clip and assign textual attributes that aid in indexing
the video. As the size of the database increases, the amount
of video that is retrieved by a textual query also increases.
It is generally agreed that after a certain stage, textual at-
tributes cannot further reduce the size of the retrieved data.
Under these circumstances, it is desirable to automatically
extract and organize content information from the video
which can then be used for content-based retrieval.
1.2. Video Clustering
Video contains huge amounts of data which needs to be
organized and compressed in an efficient manner (e.g., one
hundred hours of video contains about
10
million frames
requiring about
7
:
5
TeraBytes of data [1]). Recent work in
digital video retrieval has stressed on a hierarchical repre-
sentation of video for ease of understanding, representing,
browsing, and indexing [1, 20]. During the parsing process,
video clips are segmented into scenes. Scenes are further
segmented into shots which are each represented in terms
of a few keyframes. A scene which represents the highest
level of hierarchy consists of a group of shots that repre-
sent an abstract meaning, such as a beach scene, a dialogue
in a restaurant, a wedding, etc. A shot is defined as a se-
quence of frames that represent a continuous action in time
and space. Thus, in the scenario of a restaurant dialogue be-
tween Mr. X and Ms. Y, a shot may consist of the sequence
of frames concentrating on Ms. Y as she speaks to Mr. X. A
shot generally consists of multiple frames, many of which
are very similar in content. It is thus desirable to represent
each shot with a minimal set of keyframes that capture the
semantic content of the shot. Automatic schemes for shot
detection and subsequent keyframe extraction have been re-
ported in the literature.
Given the above hierarchical representation, a user can
now be presented with a few keyframes that capture the se-
mantic content of the shots. However, a video clip may
contain a number of shots. For example, Yeung et al. [16]
report upto
300
shots in a
15
minute clip of Terminator
2
and a
30
minute clip of sitcom “Frasier”. Assuming an av-
erage of
3
keyframes per shot, close to
1
;
000
keyframes
would be required to represent these video clips. In a dig-
ital library with over
100
hours of digitized video, about
100
;
000
keyframes may be extracted. Indexing and cluster-
ing of these keyframes would then allow the users to jump
across video clips to location of their interest. Our goal is to
develop a scheme for automatic classification of keyframes