Joint Shot Boundary Detection and Key Frame Extraction
Xiao Liu
1
, Mingli Song
1
, Luming Zhang
1
, Senlin Wang
1
Jiajun Bu
1
, Chun Chen
1
and Dacheng Tao
2
1
Zhejiang Provincial Key Laboratory of Service Robot
College of Computer Science, Zhejiang University
{ender liux, brooksong, snail wang, bjj, chenc}@zju.edu.cn
2
Center for Quantum Computation and Information Systems, UTS
dacheng.tao@gmail.com
Abstract
Representing a video by a set of key frames is use-
ful for efficient video browsing and retrieving. But key
frame extraction keeps a challenge in the computer vi-
sion field. In this paper, we propose a joint framework
to integrate both shot boundary detection and key frame
extraction, wherein three probabilistic components are
taken into account, i.e. the prior of the key frames, the
conditional probability of shot boundaries and the con-
ditional probability of each video frame. Thus the key
frame extraction is treated as a Maximum A Posteriori
which can be solved by adopting alternate strategy. Ex-
perimental results show that the proposed method pre-
serves the scene level structure and extracts key frames
that are representative and discriminative.
1 Introduction
There are millions of cameras over the world captur-
ing a gigantic amount of video data every day and raises
a new challenge: the mass storage and frequent retrieval
lead to temp-spatial cost inevitably. Hence it is valuable
to allow people to retrieve or gain certain perspectives
of a video without watching all the video data.
To maximally transfer the cues from the video into a
limited number of key frames, Zhang et al. [5] proposed
selecting a key frame if its histogram significantly dif-
fers from the previous selected one. This method fails to
guarantee the representativeness of the key frames. By
representing each frame as a color histogram, Zhuang et
al. [6] clustered the frames of a video into several clus-
ters, and further obtained a key frame to describe each
cluster. This algorithm totally ignored the temporal in-
formation, which is very important for key frame repre-
sentation. Won et al. [4] detected video shot boundaries
using the luminance variance. Their method is based
on the difference of modelling errors of an ideally mod-
elled transition. Cernekov et al. [1] firstly detected shot-
s and then extracted key frames using mutual informa-
tion and the joint entropy. Kelm et al. [2] segmented
video into shots by detecting gradual and abrupt cut-
s, and extracted key frames using visual attention fea-
tures. Sun et al. [3] extracted key frames at the peaks of
the distance curve of color distribution between frames.
These methods rely on effective shot boundary detec-
tion. Unfortunately, shot boundary detection is data de-
pendent, and it is difficult to obtain accurate detection
on different videos. Furthermore, even having gotten
semantic shots, the algorithms cannot guarantee each
shot involves a unique qualified key frame.
In contrast to the previous algorithms which first-
ly detect shot boundaries and then extract key frames
based on the division, to solve or at least reduce the
aforementioned problems, we propose a joint frame-
work to integrate both shot boundary detection and key
frame extraction by a probabilistic model. The pro-
posed algorithm is designed to divide a video into fixed
number of shots and to select a key frame for each
shot such that the selected key frames are best match-
ing to the original video. This formulation enables the
shot boundary detection and key frame extraction ben-
efit from each other. And the key frame extraction is
treated as a Maximum A Posterior problem which can
be solved by adopting alternate strategy.
2 A Probabilistic Model for Representing
Video by Key Frames
For frame-based video summarization and retrieval,
shot boundary detection is usually taken as the first
21st International Conference on Pattern Recognition (ICPR 2012)
November 11-15, 2012. Tsukuba, Japan
978-4-9906441-0-9 ©2012 ICPR 2565