Creating Video Summarization From Emotion
Perspective
Yijie LAN, Shikui WEI
, Ruoyu LIU, Yao ZHAO
Institute of Information Science,
Beijing Jiaotong University, Beijing 10044, China
E-mail: 14120320@bjtu.edu.cn, shkwei@bjtu.edu.cn, 121120 62@bjtu.edu.cn, yzhao@bjtu.edu.cn
*
Corresponding Author
Abstract—This paper proposes a novel approach of
summarization to non-professional edited videos containing a
lot of redundant information, from the viewpoint of emotion.
Ground-truth emotion scores of each frame are firstly obtained
from our dataset annotated by humans. Then, we extract
emotional features of each frame from training set videos. After
that, we train our predictive model from the feature vectors and
emotion scores by using linear regression. Meanwhile, videos
are partitioned into several segments. We select a subset of
segments in the video whose length is below a specific value by
optimizing the sum of their emotion scores. This subset of
segments can be treated as the emotional video summarization
desired. The experimental results show that the proposed
scheme can achieve an effective emotional video summarization.
Keywords—Video summarization; emotion; regression model;
video segment
I. INTRODUCTION
With the rapid development of the technology of
multimedia and Internet, multimedia data experienced
explosive growth in the past decades. Digital videos, as a
major carrier of multimedia information, have been applied
widely in many aspects of our life. However, the majority of
these videos contain much redundant information. When
searching for interesting segments or significant parts of
videos which are tedious, it consumes a lot of time for the
viewer. Video summarization can solve this problem
efficiently. Video summarization, which is similar to text
summarization, is to summarize an original video, find some
useful or needed parts and comprise them into a short-time,
compact and efficient summarization. It can be easy for users
to find the video parts that they want to watch and transmit,
such as movie trailers.
It cannot be denied that each video has some emotion in
it. However, almost all of the previous methods neglect a fact
that emotion is a key point of information we want to get
from video. Generally speaking, users often focus on the
video part which has strong emotion. For example,
professional movie trailer is usually composed of the shots
that make users feel some sense of strong affection. Another
example is that videos of birthday party often express
delightful emotion. When we make a summary of this video,
the summarized parts tend to be the parts that have smiling
faces or about blowing out candles, instead of the parts about
preparing party, even though they have equivalent semantic
meaning of “birthday party”. Therefore, emotional video
summarization can provide great assistance in intuitive
understanding of a video.
The contributions of the proposed scheme can be
summarized as follows:
1. A new dataset is constructed which is annotated by
humans to get “emotion scores”. This dataset consists
of several movie fragments edited by non-professional
users, which contain redundant information. Emotion
scores are got by voting, which reflect the intensity of
emotion.
2. An emotion feature vectors is proposed, which
combines low-, mid- and high-level visual features. It
stores emotional information by establishing the
"bridge" between emotion and visual features.
3. A new approach for video summarization from emotion
perspective is proposed, by estimating the emotion
scores of segments and optimizing the sum of them.
II. RELATED WORKS
Two related techniques are introduced in this section, i.e.,
video summarization and emotion recognition.
A. Video Summarization
From the 1990s, video summarization technique has
drawn much research and industrial interest. In [1], Truong
et al. present a detail review of the video summarization
works before 2007. They describe two main types of video
abstracts: key-frames and video skims.
Key-frames are also called static storyboard which
consists of a collection of salient images extracted from the
video source. Early works in this form extracted key-frames
by using optical flow computations [2] or low-level features
[3]. In recent years, key-frames are selected using clustering
based on visual features [4], objects [5] or change detection
[6]. However, these key-frame based approaches are not
sufficient due to discarding the most important motion
information. The second form is video skim, which is also
called dynamic summary, which consists of a collection of
978-1-5090-1345-6/16/$31.00 ©2016 IEEE