Facebook AI研究：基于人类中心的模型识别人-物互动

需积分: 9 59 浏览量更新于2024-09-10 收藏 1.31MB PDF 举报

"《检测与识别人类-对象交互》" 在计算机视觉领域，理解视觉世界不仅限于识别单个物体实例，还要洞察它们如何相互作用。人类通常处于这些互动的核心，因此检测人类-对象交互是一项重要的实际问题，也具有科学价值。这篇由Georgia Gkioxari、Ross Girshick、Piotr Dollár和Kaiming He等人在Facebook AI Research (FAIR)合作撰写的论文探讨了这一课题。论文的主要目标是解决在日常照片中识别出人类、动作和对象组成的三元组（⟨human, verb, object⟩）的问题。作者提出了一种基于人的中心视角的新型模型，其核心假设是：一个人的外观特征，如姿势、衣着和行为，是定位他们所交互对象的强大线索。为了利用这个线索，模型学习预测根据检测到的人的外观，对目标物体位置的特定动作分布。该模型不仅专注于人员检测，还同时学习对象检测，通过有效融合这两种预测，创建了一个名为InteractNet的端到端联合训练系统。它能够在一个单一的框架内进行高效的人和物体的交互检测，这有助于提高整体性能和准确性。为了验证这种方法的有效性，研究者们采用了最新的数据集和评估标准，展示了他们的模型在处理复杂场景和多样的交互行为时的优越表现。这篇论文对于推进计算机视觉技术的发展具有重要意义，特别是对那些依赖于理解人类行为和周围环境的应用，比如智能家居、智能安全监控、虚拟现实和增强现实等领域。通过结合深度学习和视觉理解，InteractNet模型提供了一种新颖且有效的手段来解析和模拟人类在图像中的动态行为，从而推动了人机交互研究的前沿。"

Detecting and Recognizing Human-Object Interactions

Georgia Gkioxari Ross Girshick Piotr Doll

ar Kaiming He

Facebook AI Research (FAIR)

Abstract

To understand the visual world, a machine must not only

recognize individual object instances but also how they in-

teract. Humans are often at the center of such interac-

tions and detecting human-object interactions is an impor-

tant practical and scientiﬁc problem. In this paper, we ad-

dress the task of detecting ⟨human, verb, object⟩ triplets

in challenging everyday photos. We propose a novel model

that is driven by a human-centric approach. Our hypothesis

is that the appearance of a person – their pose, clothing,

action – is a powerful cue for localizing the objects they

are interacting with. To exploit this cue, our model learns

to predict an action-speciﬁc density over target object loca-

tions based on the appearance of a detected person. Our

model also jointly learns to detect people and objects, and

by fusing these predictions it efﬁciently infers interaction

triplets in a clean, jointly trained end-to-end system we call

InteractNet. We validate our approach on the recently intro-

duced Verbs in COCO (V-COCO) and HICO-DET datasets,

where we show quantitatively compelling results.

1. Introduction

Visual recognition of individual instances, e.g., detect-

ing objects [

10, 9, 27] and estimating human actions/poses

[

12, 32, 2], has witnessed signiﬁcant improvements thanks

to deep learning visual representations [

18, 30, 31, 17].

However, recognizing individual objects is just a ﬁrst step

for machines to comprehend the visual world. To under-

stand what is happening in images, it is necessary to also

recognize relationships between individual instances. In

this work, we focus on human-object interactions.

The task of recognizing human-object interactions [

13,

33, 6, 14, 5] can be represented as detecting ⟨human, verb,

object⟩ triplets and is of particular interest in applica-

tions and in research. From a practical perspective, pho-

tos containing people contribute a considerable portion of

daily uploads to internet and social networking sites, and

thus human-centric understanding has signiﬁcant demand

in practice. From a research perspective, the person cate-

gory involves a rich set of actions/verbs, most of which are

(a) (b)

cut

knife

(c)

stand

(d)

target

prob density

for cut

Figure 1. Detecting and recognizing human-object interactions.

(a) There can be many possible objects (green boxes) interacting

with a detected person (blue box). (b) Our method estimates an

action-type speciﬁc density over target object locations from the

person’s appearance, which is represented by features extracted

from the detected person’s box. (c) A ⟨human, verb, object⟩

triplet detected by our method, showing the person box, action

(cut), and target object box and category (knife). (d) Another pre-

dicted action (stand), noting that a person can simultaneously take

multiple actions and an action may not involve any objects.

rarely taken by other subjects (e.g., to talk, throw, work).

The ﬁne granularity of human actions and their interactions

with a wide array of object types presents a new challenge

compared to recognition of entry-level object categories.

In this paper, we present a human-centric model for rec-

ognizing human-object interaction. Our central observation

is that a person’s appearance, which reveals their action and

pose, is highly informative for inferring where the target

object of the interaction may be located (Figure

1(b)). The

search space for the target object can thus be narrowed by

conditioning on this estimation. Although there are often

many objects detected (Figure

1(a)), the inferred target lo-

cation can help the model to quickly pick the correct object

associated with a speciﬁc action (Figure

1(c)).

We implement this idea as a human-centric recognition

8359

下载后可阅读完整内容，剩余8页未读，立即下载

qq_41067270

粉丝: 0

Facebook AI研究：基于人类中心的模型识别人-物互动

日本吉永温控仪表-TU40系列 产品介绍.pdf

Detecting and Recognizing Human-Object Interactions

hico-expriment

detecing network modules in fMRI times series .pdf

图像识别处理器大牛Hoi-JunYoo学生论文合集（PhDThesis）比文章详细.rar

cole_02_0507.pdf

工程硕士开题报告：无线传感器网络路由技术及能量优化LEACH协议研究

【东海期货-2025研报】东海贵金属周度策略：金价高位回落，阶段性回调趋势初现.pdf

图像数据处理工具+数据(帮助用户快速划分数据集并增强图像数据集。通过自动化数据处理流程，简化了深度学习项目的数据准备工作)

diminico_02_0709.pdf

最新资源

日本吉永温控仪表-TU40系列产品介绍.pdf