Detecting and Recognizing Human-Object Interactions
Georgia Gkioxari Ross Girshick Piotr Doll
´
ar Kaiming He
Facebook AI Research (FAIR)
Abstract
To understand the visual world, a machine must not only
recognize individual object instances but also how they in-
teract. Humans are often at the center of such interac-
tions and detecting human-object interactions is an impor-
tant practical and scientific problem. In this paper, we ad-
dress the task of detecting ⟨human, verb, object⟩ triplets
in challenging everyday photos. We propose a novel model
that is driven by a human-centric approach. Our hypothesis
is that the appearance of a person – their pose, clothing,
action – is a powerful cue for localizing the objects they
are interacting with. To exploit this cue, our model learns
to predict an action-specific density over target object loca-
tions based on the appearance of a detected person. Our
model also jointly learns to detect people and objects, and
by fusing these predictions it efficiently infers interaction
triplets in a clean, jointly trained end-to-end system we call
InteractNet. We validate our approach on the recently intro-
duced Verbs in COCO (V-COCO) and HICO-DET datasets,
where we show quantitatively compelling results.
1. Introduction
Visual recognition of individual instances, e.g., detect-
ing objects [
10, 9, 27] and estimating human actions/poses
[
12, 32, 2], has witnessed significant improvements thanks
to deep learning visual representations [
18, 30, 31, 17].
However, recognizing individual objects is just a first step
for machines to comprehend the visual world. To under-
stand what is happening in images, it is necessary to also
recognize relationships between individual instances. In
this work, we focus on human-object interactions.
The task of recognizing human-object interactions [
13,
33, 6, 14, 5] can be represented as detecting ⟨human, verb,
object⟩ triplets and is of particular interest in applica-
tions and in research. From a practical perspective, pho-
tos containing people contribute a considerable portion of
daily uploads to internet and social networking sites, and
thus human-centric understanding has significant demand
in practice. From a research perspective, the person cate-
gory involves a rich set of actions/verbs, most of which are
(a) (b)
cut
knife
(c)
stand
(d)
target
prob density
for cut
Figure 1. Detecting and recognizing human-object interactions.
(a) There can be many possible objects (green boxes) interacting
with a detected person (blue box). (b) Our method estimates an
action-type specific density over target object locations from the
person’s appearance, which is represented by features extracted
from the detected person’s box. (c) A ⟨human, verb, object⟩
triplet detected by our method, showing the person box, action
(cut), and target object box and category (knife). (d) Another pre-
dicted action (stand), noting that a person can simultaneously take
multiple actions and an action may not involve any objects.
rarely taken by other subjects (e.g., to talk, throw, work).
The fine granularity of human actions and their interactions
with a wide array of object types presents a new challenge
compared to recognition of entry-level object categories.
In this paper, we present a human-centric model for rec-
ognizing human-object interaction. Our central observation
is that a person’s appearance, which reveals their action and
pose, is highly informative for inferring where the target
object of the interaction may be located (Figure
1(b)). The
search space for the target object can thus be narrowed by
conditioning on this estimation. Although there are often
many objects detected (Figure
1(a)), the inferred target lo-
cation can help the model to quickly pick the correct object
associated with a specific action (Figure
1(c)).
We implement this idea as a human-centric recognition
1
8359