(3) Our parsing algorithm can afford to generate all possible
parse graphs of single events, combine the parse graphs to
obtain the interpretation of the input video, and achieve
the global maximum-a posteriori inference.
(4) The agent’s goal and intent at each time point is inferred by a
bottom-up and top-down process based on the top-ranked
parse graphs as the most probable interpretations. We show
in human experiments that our parsing algorithm can cor-
rectly infer agent’s goals and intents according to the video
content.
(5) We show that event context can be used to improve the
detection result of atomic actions, and to better segment
and recognize objects in the scene. We put the event learn-
ing and inference in the perspective of scene context, where
there is a rich collection of agent-environment interactions.
By inference on the joint probability of agent and environ-
ment events, we show how to use recognition of actions to
help object recognition and scene segmentation.
(6) We collect a video data set, which includes videos of daily
life captured both in indoor and outdoor scenes to evaluate
the proposed algorithm. The events in the videos include
single-agent events, multi-agent events, and concurrent
events. The results of the algorithm are evaluated by human
subjects and our experiments show satisfactory results.
This paper is an enhanced combination of our previous confer-
ence papers [22,23] which focus on event parsing and grammar
learning respectively. Here we integrate them into a coherent
framework. We add more experimental results to evaluate the pro-
posed algorithm, and new experiments on segmenting and recog-
nizing objects in scene are shown in this journal paper.
2. Event representation by T-AOG
In this section, we introduce the T-AOG for event
representation.
T-AOG is based on interactions between agents and objects in
the scene. In the videos that we collected, there are 13 classes of
interest objects including mug, laptop, water dispenser in our
training and testing data. These objects should be detected auto-
matically, however, detection of multi-class objects in a complex
scene cannot be solved perfectly by the state-of-art. Therefore,
we adopt a semi-automatic object detection system. The objects
in each scene are detected by the Multi-class boosting with feature
sharing [24], and segmented by a recent indoor scene parsing algo-
rithm [25]. This is not time consuming as it is done only once for
each scene, and the objects of interest are tracked automatically
during the video events. Fig. 1 shows the detection result of the ob-
jects of interest in an office.
2.1. Grounded relations—the alphabet
The T-AOG is defined on a set of unary and binary relations
which can be directly detected from video. We call these relations
the grounded relations.
Fig. 1. The detection result of objects in the office scene.
Fig. 2. Some unary relations. The left part of the table shows the four unary relations as agent poses, including ‘Stand’, ‘Stretch’, ‘Bend’ and ‘Sit’. The right part shows the two
fluents (‘On’and ‘Off’) of the phone and the screen of laptop.
M. Pei et al. / Computer Vision and Image Understanding xxx (2013) xxx–xxx
3
Please cite this article in press as: M. Pei et al., Learning and parsing video events with goal and intent prediction, Comput. Vis. Image Understand. (2013),
http://dx.doi.org/10.1016/j.cviu.2012.12.003