视频事件解析：目标与意图预测的T-AOG方法

研究论文

14 浏览量更新于2024-07-15 收藏 2.54MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"Learning and Parsing Video Events with Goal and Intent Prediction" 是一篇研究论文，主要探讨了如何通过结合概率性的时序与或图（Temporal And-Or Graph, T-AOG）和无监督学习方法来解析视频中的事件。作者们关注的是如何在没有预先标注数据的情况下，理解视频中人物的行为目标和意图。论文的核心内容涉及以下几个关键知识点： 1. T-AOG模型：这是一种用于表示视频事件的概率性语法模型，它将视频中的时空结构组织成一个图形结构。T-AOG的字母表包含了基于图像帧的坚实基础，包括人物的姿势、他们与场景中物体的交互等空间关系。 2. 终端节点与原子动作：T-AOG的终端节点代表的是原子动作，这些动作由一系列基于图像帧的地面关系所定义。它们是构成复杂事件的基本单元。 3. 逻辑节点：And-node（与节点）代表了一系列动作的序列，即多个动作按照特定顺序发生的情况。而Or-node（或节点）则表示动作组合的多样性，意味着可能有多种方式执行同样的事件序列。 4. 无监督学习：论文提出了一种新颖的方法，即通过无监督学习的方式自动生成和优化T-AOG，无需依赖大量的标注数据。这种方法旨在自动发现视频中事件的潜在结构和规律。 5. 目标预测与意图理解：学习到的T-AOG不仅能够解析视频事件，还能帮助预测人物的行为目标和潜在意图。这对于诸如视频理解和智能分析等领域具有重要意义，例如行为识别、视频摘要生成或者预测未来的场景发展。 6. 论文进展与贡献：该论文于2012年3月接收，12月接受，并且在线发布日期表明其在学术界有一定的影响力。关键词如“时序与或图”、“事件解析”、“无监督学习”、“目标预测”和“信息投影”凸显了研究的核心技术路径。这篇论文在视频事件分析领域引入了一个强大的工具，即结合概率性和无监督学习的T-AOG模型，以更深入地理解视频中的动态行为，从而提升对视频内容的智能分析能力。

资源详情

资源推荐

(3) Our parsing algorithm can afford to generate all possible

parse graphs of single events, combine the parse graphs to

obtain the interpretation of the input video, and achieve

the global maximum-a posteriori inference.

(4) The agent’s goal and intent at each time point is inferred by a

bottom-up and top-down process based on the top-ranked

parse graphs as the most probable interpretations. We show

in human experiments that our parsing algorithm can cor-

rectly infer agent’s goals and intents according to the video

content.

(5) We show that event context can be used to improve the

detection result of atomic actions, and to better segment

and recognize objects in the scene. We put the event learn-

ing and inference in the perspective of scene context, where

there is a rich collection of agent-environment interactions.

By inference on the joint probability of agent and environ-

ment events, we show how to use recognition of actions to

help object recognition and scene segmentation.

(6) We collect a video data set, which includes videos of daily

life captured both in indoor and outdoor scenes to evaluate

the proposed algorithm. The events in the videos include

single-agent events, multi-agent events, and concurrent

events. The results of the algorithm are evaluated by human

subjects and our experiments show satisfactory results.

This paper is an enhanced combination of our previous confer-

ence papers [22,23] which focus on event parsing and grammar

learning respectively. Here we integrate them into a coherent

framework. We add more experimental results to evaluate the pro-

posed algorithm, and new experiments on segmenting and recog-

nizing objects in scene are shown in this journal paper.

2. Event representation by T-AOG

In this section, we introduce the T-AOG for event

representation.

T-AOG is based on interactions between agents and objects in

the scene. In the videos that we collected, there are 13 classes of

interest objects including mug, laptop, water dispenser in our

training and testing data. These objects should be detected auto-

matically, however, detection of multi-class objects in a complex

scene cannot be solved perfectly by the state-of-art. Therefore,

we adopt a semi-automatic object detection system. The objects

in each scene are detected by the Multi-class boosting with feature

sharing [24], and segmented by a recent indoor scene parsing algo-

rithm [25]. This is not time consuming as it is done only once for

each scene, and the objects of interest are tracked automatically

during the video events. Fig. 1 shows the detection result of the ob-

jects of interest in an ofﬁce.

2.1. Grounded relations—the alphabet

The T-AOG is deﬁned on a set of unary and binary relations

which can be directly detected from video. We call these relations

the grounded relations.

Fig. 1. The detection result of objects in the ofﬁce scene.

Fig. 2. Some unary relations. The left part of the table shows the four unary relations as agent poses, including ‘Stand’, ‘Stretch’, ‘Bend’ and ‘Sit’. The right part shows the two

ﬂuents (‘On’and ‘Off’) of the phone and the screen of laptop.

M. Pei et al. / Computer Vision and Image Understanding xxx (2013) xxx–xxx

Please cite this article in press as: M. Pei et al., Learning and parsing video events with goal and intent prediction, Comput. Vis. Image Understand. (2013),

http://dx.doi.org/10.1016/j.cviu.2012.12.003

剩余14页未读，继续阅读