实时视频场景聚类：自适应局部图划分算法

155 浏览量更新于2024-08-27 收藏 662KB PDF 举报

本文主要探讨了"视频场景聚类的实时、自适应和基于局部图的划分方法"这一主题，针对视频技术领域中的一个重要问题——如何有效地将连续的视频镜头（shots）归类到不同的场景（video scenes）。作者是Hong Lu（IEEE会员）、Yap-Peng Tan（IEEE高级会员）和Xiangyang Xue（IEEE会员），他们在IEEE Transactions on Circuits and Systems for Video Technology的第21卷第11期（2011年11月）发表了这项研究。首先，他们提出了一种高效的图划分算法，用于对视频镜头进行场景分组。该方法利用Fisher's discriminant analysis来识别相似的镜头，即使这些相似度测量可能具有有限的区分能力。为了确保分类的准确性，他们设计了一种策略，通过最大化同一簇内镜头之间的相似度以及最小化不同簇间镜头间的差异，来进行镜头聚类。接着，作者注意到视频数据通常按照时间顺序获取和观看，因此他们提出了一个基于局部性的peer-group filtering (PGF) 和图划分策略。这种方法特别适用于处理小规模的视频片段，例如50帧、100帧等。这种局部性方法的优点在于能够更好地捕捉到视频内容的连续性和相关性，使得场景划分更加自然和符合观众的观感。在实施过程中，他们可能使用了一些先进的图像特征，如色彩直方图（shot color histograms），来增强镜头相似度的计算。此外，他们还可能关注block ordering algorithm（块排序算法），这是一种优化图形布局的技术，有助于提高图划分的效率和性能。这篇论文提供了视频场景聚类的一种创新方法，结合了实时性、自适应性和局部性，旨在提高视频内容的理解和组织，这对于视频分析、内容检索和推荐系统等领域具有重要意义。通过这种方式，可以更好地理解和组织大规模视频数据，提升用户体验，并为后续的视频编辑、内容生成或智能分析提供有力支持。

LU et al.: REAL-TIME, ADAPTIVE, AND LOCALITY-BASED GRAPH PARTITIONING METHOD FOR VIDEO SCENE CLUSTERING 1749

within a cluster and the distinction of shots between different

clusters. On the other hand, considering that the video data are

normally obtained and viewed sequentially, a sequential graph

partitioning approach is also devised for long video sequence,

with an aim to further reduce the computational complexity

without degrading much the clustering accuracy.

Preliminary versions of our related work have been reported

in [12] and [34]. This paper extends our contribution in both

the technical and evaluation parts. First, the scene likeness

is determined by a ﬁxed threshold in [34] or determined by

minimizing the average probability of clustering error, in the

sense of maximum a posteriori probability in [12]. In this

paper, an adaptive approach of PGF method is proposed to

identify the shots that are visually similar to each individual

shot, which will be described in detail in Section II-A. Further-

more, a sequential graph partitioning approach combined with

the local PGF method for processing long video sequences is

proposed. The proposed method is also compared with other

methods such as k-means clustering in [34] and normalized

cuts method in [12] and [34]. The comparison is now based on

the Minkowski measure considering all obtained clusters. The

computational complexity analysis of the proposed method is

also compared with other existing methods.

The rest of this paper is organized as follows. In Section II,

we present the PGF scheme to identify the similar shots of

each individual shot, and then construct the scene likeness

matrix of the video under analysis. Section III formulates the

clustering of video scenes as a graph partitioning problem. We

then compare the computational complexity of the proposed

approach against that of the conventional k-means clustering,

normalized cuts in Section IV, and describe the sequential

partitioning approach in Section V. Experimental simulations

and results are presented in Section VI, and this paper is

concluded in Section VII.

II. Scene Likeness of Shots

To group the shots of a given video into shot clusters with

disparate scenes, we ﬁrst examine the scene likeness among

shots. To begin with, we segment the video into a sequence of

shots S = {S

,...,S

}, say M of them, by using a popular

shot-boundary detector [36], [37]. Each shot is considered a

basic unit for processing, and its shot color histogram (SCH)—

the bin-wise median of all the frame color histograms within

the shot [38]—is computed as the shot feature for scene

clustering. A frame color histogram records the percentage of

each quantized color among all pixels in a frame. In this paper,

we compute the frame (and hence the shot) color histograms

in hue-saturation-value color space with its color coordinates

uniformly quantized into 12 (hue), 4 (saturation), and 4 (value)

bins, respectively, resulting in a total of 192 bins (or quantized

colors). The color space and quantization scheme are chosen

for their good performance, as also reported in other work of

content-based image and video analysis [39], [40].

Let X = {X

,...,X

} denote the SCHs of the M

video shots. The visual similarity between two shots S

and

is judged based on the intersection of their SCHs, deﬁned

as I(S



k=1

min(X

(k),X

(k)) [41], where n, equal

to 192 in this paper, is the total number of histogram bins.

However, owing to its limited discrimination capability, such

histogram intersection cannot provide a deﬁnite indication

on whether or not the two shots share a similar scene. For

example, shots featuring different scenes may have a non-

negligible histogram intersection because of common color

contents, while shots covering similar scenes may not have

a large histogram intersection due to occlusion or difference

between two camera positions or camera view angles. To

overcome this limitation, especially the limited number of

scenes appearing in sports and news videos and the similar

visual contents, we need to convert the histogram intersection

into a scene likeness indicator, L(S

), evincing whether the

two shots have a similar scene. One intuitive way for this is

by comparing the histogram intersection against a predeﬁned

threshold τ, referred to as scene likeness threshold, as follows:

L(S



1, I(S

) ≥ τ

0, otherwise

(1)

where two shots are considered having a similar scene if

L(S

)=1.

A threshold so chosen works under the premise that the

pertinent probabilities and density functions estimated from

the training set are good approximate to that of the videos to

be analyzed. In practice, this is often not the case because of

the large variation in video contents. As an example, Fig. 2

shows the range of the scene likeness threshold suitable for

identifying the similar shots (i.e., shots with similar scenes)

of each shot in a test Tennis video, where shots 1–46 and

shots 89–137 cover the tennis game, shots 47–88 cover

several commercials, and the ground truths of similar shots

are determined manually. We can observe that, even for a

single video, the suitable range of the scene likeness threshold

can vary from shot to shot. For instance, as the contents of

most commercial shots are very different from that of other

shots (hence their suitable thresholds span a wide range),

it is possible to obtain a ﬁxed scene likeness threshold for

classifying commercial shots. This is, however, not the case

for shots covering the tennis game. As the tennis shots are

mainly taken by a ﬁxed camera to expose the same players

and audience, they share similar color contents, and hence,

the suitable range of the scene likeness threshold for each

shot is much smaller than that of commercial shots, and may

not overlap with others. Furthermore, the optimal threshold

determined for the commercial shots is likely not suitable for

the tennis shots, and vice versa. Hence, much desired is an

approach that can determine the optimal threshold for each

shot based on its content. We propose such a scheme in the

next section.

A. Peer-Group Filtering

To obtain the optimal scene likeness threshold for each shot

, we propose using a PGF scheme to partition the shots

under comparison into two clusters, shots that are similar

to shot S

and shots that are not. The PGF scheme was

previously proposed by Deng et al. for color image enhance-

ment, quantization, and segmentation of texture regions and

obtained promising results [42], [43]. The main function of

剩余12页未读，继续阅读

weixin_38685694

粉丝: 4
资源: 900

实时视频场景聚类：自适应局部图划分算法

聚类是一个将数据集划分为若干组或簇的过程.pdf

基于自适应聚类算法的异常点检测研究

改进的基于划分算法的三维点云聚类matlab实现_三维点云颜色_点云聚类算法_改进k-means_k-mean_K.

基于梯度方向、极化变换和聚类算法的图像主特征直线检测

变粒度二次聚类方法

基于网格的数据流聚类算法_刘青宝1

聚类算法及聚类融合算法研究.docx

DBE算法实现自适应聚类数的确定

自适应弹性网络算法在聚类分析中的优势与应用

改进自适应k均值聚类在三维点云骨架提取中的应用

最新资源