YOLO原论文：统一实时目标检测框架

需积分: 0 200 浏览量更新于2024-08-03 收藏 1.01MB PDF 举报

YOLO（You Only Look Once）是目标检测领域的一项革命性工作，由Joseph Redmon、Santosh Divvala、Ross Girshick和Ali Farhadi在2016年提出，其原始论文标题为"Unified, Real-Time Object Detection"。这篇论文标志着目标检测方法的一个重大转变，它不再像传统的方法那样将目标检测视为分类器的副产品，而是将其重新定义为回归问题。在传统的目标检测方法中，如Faster R-CNN，通常分为两个阶段：首先，区域提议网络（RPN）通过分类和边界框调整来预选可能包含目标的区域，然后这些候选区域会被传递给RoI池化层和分类器进行更精细的类别判断。然而，YOLO则是将这两个阶段合并成一个单一的神经网络结构，这个网络直接从完整的图像中预测出目标的位置（通过边界框）以及目标的类别概率。 YOLO的核心创新在于它将目标检测的任务简化为一次前向传播过程。它不依赖于复杂的区域选择和细化过程，而是通过一个深度卷积神经网络（CNN）同时预测多个候选边界框及其对应的类别概率。这使得YOLO能够在实时性能上实现显著提升，基础版本的YOLO在每秒可以处理45张图像，而Fast YOLO甚至达到每秒155帧的速度，同时保持较高的平均精度（mAP）。尽管YOLO的定位精度相对于其他实时检测器可能稍有逊色，但它较少产生误报（false positives），这使得它在实际应用中更加高效，尤其是在对速度有极高要求的场景，如自动驾驶、视频监控等。它的设计思路是对传统检测方法的彻底革新，推动了目标检测领域的实时性和效率的边界。为了深入了解YOLO的工作原理和优点，可以参考博客链接提供的详细讲解，那里会进一步阐述YOLO的设计细节、训练策略以及与其他检测算法的比较。YOLO系列论文的成功，不仅在技术层面展示了端到端优化的优势，也对后续的研究产生了深远影响，许多现代目标检测系统都借鉴了其简洁高效的设计理念。

You Only Look Once:

Uniﬁed, Real-Time Object Detection

Joseph Redmon

∗

, Santosh Divvala

∗†

, Ross Girshick

, Ali Farhadi

∗†

University of Washington

∗

, Allen Institute for AI

†

, Facebook AI Research

http://pjreddie.com/yolo/

Abstract

We present YOLO, a new approach to object detection.

Prior work on object detection repurposes classiﬁers to per-

form detection. Instead, we frame object detection as a re-

gression problem to spatially separated bounding boxes and

associated class probabilities. A single neural network pre-

dicts bounding boxes and class probabilities directly from

full images in one evaluation. Since the whole detection

pipeline is a single network, it can be optimized end-to-end

directly on detection performance.

Our uniﬁed architecture is extremely fast. Our base

YOLO model processes images in real-time at 45 frames

per second. A smaller version of the network, Fast YOLO,

processes an astounding 155 frames per second while

still achieving double the mAP of other real-time detec-

tors. Compared to state-of-the-art detection systems, YOLO

makes more localization errors but is less likely to predict

false positives on background. Finally, YOLO learns very

general representations of objects. It outperforms other de-

tection methods, including DPM and R-CNN, when gener-

alizing from natural images to other domains like artwork.

1. Introduction

Humans glance at an image and instantly know what ob-

jects are in the image, where they are, and how they inter-

act. The human visual system is fast and accurate, allow-

ing us to perform complex tasks like driving with little con-

scious thought. Fast, accurate algorithms for object detec-

tion would allow computers to drive cars without special-

ized sensors, enable assistive devices to convey real-time

scene information to human users, and unlock the potential

for general purpose, responsive robotic systems.

Current detection systems repurpose classiﬁers to per-

form detection. To detect an object, these systems take a

classiﬁer for that object and evaluate it at various locations

and scales in a test image. Systems like deformable parts

models (DPM) use a sliding window approach where the

classiﬁer is run at evenly spaced locations over the entire

image [

10].

More recent approaches like R-CNN use region proposal

1. Resize image.

2. Run convolutional network.

3. Non-max suppression.

Dog: 0.30

Person: 0.64

Horse: 0.28

Figure 1: The YOLO Detection System. Processing images

with YOLO is simple and straightforward. Our system (1) resizes

the input image to 448 × 448, (2) runs a single convolutional net-

work on the image, and (3) thresholds the resulting detections by

the model’s conﬁdence.

methods to ﬁrst generate potential bounding boxes in an im-

age and then run a classiﬁer on these proposed boxes. After

classiﬁcation, post-processing is used to reﬁne the bound-

ing boxes, eliminate duplicate detections, and rescore the

boxes based on other objects in the scene [

13]. These com-

plex pipelines are slow and hard to optimize because each

individual component must be trained separately.

We reframe object detection as a single regression prob-

lem, straight from image pixels to bounding box coordi-

nates and class probabilities. Using our system, you only

look once (YOLO) at an image to predict what objects are

present and where they are.

YOLO is refreshingly simple: see Figure

1. A sin-

gle convolutional network simultaneously predicts multi-

ple bounding boxes and class probabilities for those boxes.

YOLO trains on full images and directly optimizes detec-

tion performance. This uniﬁed model has several beneﬁts

over traditional methods of object detection.

First, YOLO is extremely fast. Since we frame detection

as a regression problem we don’t need a complex pipeline.

We simply run our neural network on a new image at test

time to predict detections. Our base network runs at 45

frames per second with no batch processing on a Titan X

GPU and a fast version runs at more than 150 fps. This

means we can process streaming video in real-time with

less than 25 milliseconds of latency. Furthermore, YOLO

achieves more than twice the mean average precision of

other real-time systems. For a demo of our system running

in real-time on a webcam please see our project webpage:

http://pjreddie.com/yolo/.

Second, YOLO reasons globally about the image when

779

下载后可阅读完整内容，剩余9页未读，立即下载

钱多多先森

粉丝: 4w+
资源: 23

YOLO原论文：统一实时目标检测框架

yolov论文（You Only Look Once）

deepsort和yolo版本对应

yolov8检测头接卸

yolov8参数量有多少

yolox的预训练模型下载地址

YOLOv1算法基本原理详细介绍200字

yolov7有多少层卷积层

Joseph Redmon共写了几篇YOLO

YOLO引用哪篇参考文献

yolov8一条龙教程

最新资源