4 Yandong Li, Liqiang Wang, Tianbao Yang, Boqing Gong
26, 6]. Graph models are utilized for event detection in some approaches [26, 5]. In
general, the criteria applied in those methods for making decisions about including
or excluding shots are devised by the system developers empirically. Besides, some
approaches leverage Web images for video summarization based on the assumption
that the static Web pictures tend to contain information of interest to people, so the Web
images reveal user-oriented importance selecting video shots/frames [4, 27–29].
Supervised video summarization: Recently, several explorations on supervised video
summarization have been exerted for various goals [1, 10–13, 9, 8, 30, 17–19]. They
achieve superior performance over the traditional unsupervised clustering algorithms.
Among them, Gygli et al. try to add some supervised flavor to optimize mixture ob-
jectives with learning each criterion’s weight [12, 10]. A hierarchical model has been
proposed to learn with few labels, and it is optimized to generate video summary con-
taining interesting objects [30]. Egocentric videos [31] can be compacted with impor-
tance of people and objects [8]; on the other hand, Zheng et al. explicitly consider how
one sub-event leads to another in order to provide a better sense of story for those kinds
of videos [9]. Meanwhile, Yao et al. propose a pairwise deep ranking model to highlight
video segments of first-person videos [32]. In conclusion, supervised methods are ca-
pable of utilizing the intentions of users about what a qualified video summary is rather
than designing the systems only relying on the experts’ own perspective.
Besides, as a powerful diverse subset selection model, the determinantal point pro-
cess (DPP) has been widely used for video summarization. For instance, Gong et al.
propose the first supervised video summarization method [1] (SeqDPP) as far as we
know, it models local diversity to capture the temporal information of videos rather
than modeling global diversity. Combining long short-term memory (LSTM) with DPPs
has been studied in [19] to model the variable-range temporal dependency and diver-
sity among video frames at the same time. Effort has been spent to study transferring
summary structures from annotated videos to unseen test videos in [11]. Sharghi et al.
explore the query-focused video summarization in [17, 18]. Large margin separation
principle has been leveraged for DPPs to estimate parameters in [13].
We will provide more details of DPPs and SeqDPP in Sections 3.1 and 3.2.
Reinforcement learning (RL) provides a unified solution to both problems above.
The REINFORCE algorithm [38] is utilized to train recurrent neural network [33]. Ren-
nie et al. borrow ideas from [33] in the image captioning task and obtain very promising
results [39]. We note that the use of RL in those contexts is icing on the case in the sense
that, while RL boosts the results to some degree, the MLE is still applicable. For our
DySeqDPP model, however, RL becomes a necessary choice because it is highly in-
volved to handle the latent variables in DySeqDPP by MLE.
0
3背景:DPP和SeqDPP
0
我们在本节中简要回顾了确定性点过程(DPP)和顺序DPP(Se-
qDPP)。很快就会清楚前者如何促进所选子集的多样性,后者如何实现局部多样性。