Leveraging YOLO-World and GPT-4V LMMs for
Zero-Shot Person Detection and Action Recognition in Drone Imagery
Christian Limberg
1
, Artur Gonc¸alves
1
, Bastien Rigault
1
and Helmut Prendinger
1
Abstract— In this article, we explore the potential of zero-
shot Large Multimodal Models (LMMs) in the domain of drone
perception. We focus on person detection and action recognition
tasks and evaluate two prominent LMMs, namely YOLO-World
and GPT-4V(ision) using a publicly available dataset captured
from aerial views. Traditional deep learning approaches rely
heavily on large and high-quality training datasets. However,
in certain robotic settings, acquiring such datasets can be
resource-intensive or impractical within a reasonable time-
frame. The flexibility of prompt-based Large Multimodal Mod-
els (LMMs) and their exceptional generalization capabilities
have the potential to revolutionize robotics applications in these
scenarios. Our findings suggest that YOLO-World demonstrates
good detection performance. GPT-4V struggles with accurately
classifying action classes but delivers promising results in
filtering out unwanted region proposals and in providing a
general description of the scenery. This research represents
an initial step in leveraging LMMs for drone perception and
establishes a foundation for future investigations in this area.
I. INTRODUCTION
Recent advances in Large Language Models (LLMs) have
transformed many aspects of Machine Learning and AI [1],
[2]. Previously, the most common approach involved gather-
ing datasets that captured small contexts within specific task
domains. However, with the advent of foundation models
such as LLMs, trained on much larger datasets, this paradigm
has shifted. These models can now be utilized by providing
them with prompts that specify the domain and task. Thanks
to their strong generalization abilities, these foundation mod-
els can often be applied in a zero-shot manner [3]. While
LLMs were originally designed for processing text in Natural
Language Processing (NLP) tasks, Large Multimodal Models
(LMMs) have expanded their capabilities by incorporating
additional modalities [4], [5], [6] such as images, sounds,
and videos.
In this article, we delve into the application of two
recent image-based LMMs within a drone setting. Firstly,
we examine the YOLO-World model [7], which facilitates
prompt-based object detection. Secondly, we utilize the more
general vision model GPT-4V [4] for classifying the detected
region proposals.
An important challenge in aerial robotics is ensuring that
drones operate reliably across a wide spectrum of potential
failures. This necessitates the acquisition of a high-quality,
problem-specific dataset, which can be resource-intensive or
even impractical to obtain. Moreover, conventionally trained
*This work was supported by a fellowship within the IFI program of the
German Academic Exchange Service (DAAD).
1
National Institute of Informatics (NII), Tokyo, Japan
cnlimberg@gmail.com
models tend to excel only within the confines of their training
data. Minor variations in the environment, such as changes
in weather, seasonal fluctuations, or geographical differences,
can lead to a significant decline in the robot’s reliability.
LMMs, trained on a broader contextual scope, may not
achieve competitive performance compared to their tradi-
tional counterparts within their narrowly defined training
contexts [1]. However, their ability to generalize across
domains, facilitated by their training on significantly broader
ranges of data, enables them to better handle challenging
conditions.
This preliminary study investigates the feasibility of ap-
plying YOLO-World and GPT-4V in a practical aerial robotic
scenario involving person detection and action recognition.
A real-world application could entail locating individuals in
need following a disaster [8], [9]. Given the unpredictable
nature of potential disasters, it is crucial to utilize a model
with extensive generalization capabilities, capable of oper-
ating effectively across diverse settings. Both YOLO-World
and GPT-4V are zero-shot approaches prompted with text.
This means they could potentially be deployed in unforeseen
scenarios, as the text prompts can be adjusted quickly,
enabling the equipped robot to adapt to entirely different
objectives instantly.
This manuscript is structured as follows: Section II pro-
vides an overview of the most important related works.
Section III-A discusses the publicly available Okutama-
Action dataset that we are utilizing for our evaluation. In
Section III-B, we focus on detecting persons using YOLO-
World. In Section III-C, we apply GPT-4V on the detected
region proposals to recognize the persons’ actions. Finally,
Section IV summarizes our findings and concludes the paper.
II. RELATED WORK
From the very beginnings of computer vision research,
object detection has been a prominent task of interest. First,
hand-crafted features [10] were utilized for detecting and
recognizing objects. With the rise of deep learning, convolu-
tional neural networks, which derive features automatically
from the training data, quickly overtook established hand-
crafted methods in terms of accuracy and robustness [11].
“Two-stage” methods like R-CNN [12] and R-FCN [13] first
detect candidate regions proposals, and then classify them.
Later, “one-stage” methods such as SSD [14] and
YOLO [15] established themselves, achieving higher pro-
cessing speeds by detecting and classifying objects with
one forward pass through the network. YOLO in particular
has become extremely popular object detection method,
arXiv:2404.01571v1 [cs.CV] 2 Apr 2024