零射杀：YOLO-World与GPT-4V在无人机感知中的人员检测与动作识别

版权申诉

18 浏览量更新于2024-08-03 3 收藏 1.01MB PDF 举报

本文探讨了在无人机图像处理领域利用零射大型多模态模型（LMMs）——YOLO-World和GPT-4V进行人员检测和动作识别的重要性。随着深度学习技术的发展，传统方法往往依赖于大规模且质量高的训练数据集，这对于无人机感知任务来说尤其如此。然而，在现实应用中，获取这样的数据可能面临时间和资源的限制。 YOLO-World作为一个在无人机视觉任务中表现出色的模型，它以其高效性和精确性著称。零-shot学习特性使得YOLO-World无需预先在特定的训练数据上进行大规模标注，从而减少了对大量标记样本的依赖，适用于资源有限的环境。在无人机图像中，YOLO-World能够有效地检测出目标人员，这是其在无人系统中执行自主行为的关键能力。另一方面，GPT-4V虽然被聚焦于视觉任务，但在人员动作分类上表现一般，这可能是因为其设计更偏向于文本理解和生成。尽管如此，GPT-4V在零样本情况下展现出了过滤无关区域和提供场景描述的能力，这在某些应用场景下可能是有用的功能，比如为无人机提供目标周围环境的描述，辅助决策支持。文章通过使用公开的从空中视角获取的数据集，对这两种LMM进行了评估。结果显示，YOLO-World更适合用于实时的人员检测任务，而GPT-4V则展示了在辅助分析和理解复杂场景中的潜在价值。尽管GPT-4V在动作识别上还有待优化，但其跨模态的综合能力预示着在无人机感知领域的广阔前景。这项研究为利用LMM进行无人机感知提供了开创性的案例，揭示了零射大型多模态模型在解决实际问题时的独特优势，同时也为未来的研究者指明了方向，即如何更好地结合不同模态信息，提升无人机在复杂环境中的智能感知和反应能力。

Leveraging YOLO-World and GPT-4V LMMs for

Zero-Shot Person Detection and Action Recognition in Drone Imagery

Christian Limberg

, Artur Gonc¸alves

, Bastien Rigault

and Helmut Prendinger

Abstract— In this article, we explore the potential of zero-

shot Large Multimodal Models (LMMs) in the domain of drone

perception. We focus on person detection and action recognition

tasks and evaluate two prominent LMMs, namely YOLO-World

and GPT-4V(ision) using a publicly available dataset captured

from aerial views. Traditional deep learning approaches rely

heavily on large and high-quality training datasets. However,

in certain robotic settings, acquiring such datasets can be

resource-intensive or impractical within a reasonable time-

frame. The ﬂexibility of prompt-based Large Multimodal Mod-

els (LMMs) and their exceptional generalization capabilities

have the potential to revolutionize robotics applications in these

scenarios. Our ﬁndings suggest that YOLO-World demonstrates

good detection performance. GPT-4V struggles with accurately

classifying action classes but delivers promising results in

ﬁltering out unwanted region proposals and in providing a

general description of the scenery. This research represents

an initial step in leveraging LMMs for drone perception and

establishes a foundation for future investigations in this area.

I. INTRODUCTION

Recent advances in Large Language Models (LLMs) have

transformed many aspects of Machine Learning and AI [1],

[2]. Previously, the most common approach involved gather-

ing datasets that captured small contexts within speciﬁc task

domains. However, with the advent of foundation models

such as LLMs, trained on much larger datasets, this paradigm

has shifted. These models can now be utilized by providing

them with prompts that specify the domain and task. Thanks

to their strong generalization abilities, these foundation mod-

els can often be applied in a zero-shot manner [3]. While

LLMs were originally designed for processing text in Natural

Language Processing (NLP) tasks, Large Multimodal Models

(LMMs) have expanded their capabilities by incorporating

additional modalities [4], [5], [6] such as images, sounds,

and videos.

In this article, we delve into the application of two

recent image-based LMMs within a drone setting. Firstly,

we examine the YOLO-World model [7], which facilitates

prompt-based object detection. Secondly, we utilize the more

general vision model GPT-4V [4] for classifying the detected

region proposals.

An important challenge in aerial robotics is ensuring that

drones operate reliably across a wide spectrum of potential

failures. This necessitates the acquisition of a high-quality,

problem-speciﬁc dataset, which can be resource-intensive or

even impractical to obtain. Moreover, conventionally trained

*This work was supported by a fellowship within the IFI program of the

German Academic Exchange Service (DAAD).

National Institute of Informatics (NII), Tokyo, Japan

cnlimberg@gmail.com

models tend to excel only within the conﬁnes of their training

data. Minor variations in the environment, such as changes

in weather, seasonal ﬂuctuations, or geographical differences,

can lead to a signiﬁcant decline in the robot’s reliability.

LMMs, trained on a broader contextual scope, may not

achieve competitive performance compared to their tradi-

tional counterparts within their narrowly deﬁned training

contexts [1]. However, their ability to generalize across

domains, facilitated by their training on signiﬁcantly broader

ranges of data, enables them to better handle challenging

conditions.

This preliminary study investigates the feasibility of ap-

plying YOLO-World and GPT-4V in a practical aerial robotic

scenario involving person detection and action recognition.

A real-world application could entail locating individuals in

need following a disaster [8], [9]. Given the unpredictable

nature of potential disasters, it is crucial to utilize a model

with extensive generalization capabilities, capable of oper-

ating effectively across diverse settings. Both YOLO-World

and GPT-4V are zero-shot approaches prompted with text.

This means they could potentially be deployed in unforeseen

scenarios, as the text prompts can be adjusted quickly,

enabling the equipped robot to adapt to entirely different

objectives instantly.

This manuscript is structured as follows: Section II pro-

vides an overview of the most important related works.

Section III-A discusses the publicly available Okutama-

Action dataset that we are utilizing for our evaluation. In

Section III-B, we focus on detecting persons using YOLO-

World. In Section III-C, we apply GPT-4V on the detected

region proposals to recognize the persons’ actions. Finally,

Section IV summarizes our ﬁndings and concludes the paper.

II. RELATED WORK

From the very beginnings of computer vision research,

object detection has been a prominent task of interest. First,

hand-crafted features [10] were utilized for detecting and

recognizing objects. With the rise of deep learning, convolu-

tional neural networks, which derive features automatically

from the training data, quickly overtook established hand-

crafted methods in terms of accuracy and robustness [11].

“Two-stage” methods like R-CNN [12] and R-FCN [13] ﬁrst

detect candidate regions proposals, and then classify them.

Later, “one-stage” methods such as SSD [14] and

YOLO [15] established themselves, achieving higher pro-

cessing speeds by detecting and classifying objects with

one forward pass through the network. YOLO in particular

has become extremely popular object detection method,

arXiv:2404.01571v1 [cs.CV] 2 Apr 2024

下载后可阅读完整内容，剩余4页未读，立即下载

人工智能_SYBH

粉丝: 4w+
资源: 222

零射杀：YOLO-World与GPT-4V在无人机感知中的人员检测与动作识别

最新资源