DETR：Transformer在端到端目标检测中的应用

需积分: 3 85 浏览量更新于2024-06-27 收藏 9.33MB PDF 举报

"这篇论文《End-to-End Object Detection with Transformers》是Facebook AI团队提出的一种新的对象检测方法，它将目标检测视为直接的集合预测问题，摒弃了传统的手工设计组件，如非极大值抑制和锚框生成。这种方法的核心是DEtection TRansformer (DETR)，它采用基于集合的全局损失和Transformer编码器-解码器架构，通过学习到的对象查询来推理物体之间的关系和全局图像上下文，从而并行地直接输出最终预测结果。DETR模型在概念上简洁，不需要专门的库，与许多现代检测器相比，其准确性和运行时性能相当，并且与高度优化的Faster R-CNN基准进行了比较。" 在深度学习领域，目标检测是一个关键任务，用于识别和定位图像中的特定对象。传统的目标检测算法，如Faster R-CNN，通常包括多个步骤：特征提取、区域提议、分类和回归等，这些步骤往往包含许多手工设计的组件，如锚框（Anchor Boxes）用于生成可能的物体框，以及非极大值抑制（Non-Maximum Suppression, NMS）用于去除重叠的检测框。论文《End-to-End Object Detection with Transformers》引入了一种创新的方法，即DETR，它通过Transformer架构实现端到端的目标检测。Transformer最早在自然语言处理中被提出，因其强大的序列建模能力而受到广泛关注。DETR借鉴了Transformer的思想，但将其应用于视觉任务，特别是目标检测。 DETR的核心在于它的Transformer编码器-解码器结构。编码器负责从输入图像中提取特征，这通常由预训练的卷积神经网络（如ResNet）完成。解码器则接收这些特征，并与一组固定数量的学习对象查询（Object Queries）交互。这些查询可以看作是待检测物体的潜在表示，解码器通过多头自注意力机制和交叉注意力机制来理解图像中的物体关系和全局上下文。论文中的“基于集合的全局损失”是另一个关键点，它通过 bipartite matching（二分匹配）强制唯一预测。这意味着DETR能够直接预测出不重复的物体框和类别，而无需NMS这样的后处理步骤。这种简化不仅减少了计算复杂性，也使得模型更加透明和易于理解。 DETR的另一个优点是其模块化设计。它不需要专门的库或者特定的优化技巧，这使得它更容易被其他研究者复现和扩展。尽管DETR在性能上与Faster R-CNN相当，但其端到端的特性可能为未来的目标检测算法提供新的研究方向，尤其是在简化模型结构和提高效率方面。《End-to-End Object Detection with Transformers》为深度学习目标检测提供了一个全新的视角，将Transformer的强大学习能力应用到视觉任务中，挑战了传统检测框架的设计，有望推动目标检测领域的进一步发展。

End-to-End Object Detection with Transformers 5

boxes; (2) an architecture that predicts (in a single pass) a set of objects and

models their relation. We describe our archite ct u re in detai l in Figure 2.

3.1 Object detection set prediction l oss

DETR infers a ﬁxed-size set of N predictions, in a sin gl e pass through the

decoder, where N is set to be signiﬁcantly larger than the typical number of

objects in an image. One of t he main diﬃculties of training is to score predicted

objects (cl as s, position, size) w it h respect to the ground truth. O ur loss produces

an optimal bipartite matching between predicted and ground t r ut h objects, and

then optimize object-speciﬁc (bounding box) losses.

Let us denote by y the ground truth set of objects, and ˆy = {ˆy

}

i=1

the

set of N predictions. Assuming N is l ar ger than the number of objects in the

image, we consider y also as a set of size N padded with ? (no object). To ﬁnd

a b i par t i t e matching between these two sets we search for a permutation of N

elements  2 S

with the lowest cost:

ˆ = arg min

2S

match

, ˆy

(i)

), (1)

where L

match

, ˆy

(i)

) is a pair-wise matching cost between ground truth y

and

a pr ed i ct i on with index (i ) . This optimal assignment is computed eﬃciently

with the Hungarian algorithm, following prior work (e.g.[43]).

The matching cost takes into account both the class prediction and the sim-

ilarity of pr ed i ct e d and gr oun d truth boxes. Each element i of the ground truth

set can be seen as a y

=(c

)wherec

is the target class label (which

may be ?) and b

2 [0, 1]

is a vector t hat deﬁnes ground tr ut h box cen-

ter coordinates and its height and width relative to the image size. For the

prediction with index (i) we deﬁne probability of class c

as ˆp

(i)

) and

the predicted box as

(i)

. With these notations we deﬁne L

match

, ˆy

(i)

) as

1

6=?}

ˆp

(i)

)+1

6=?}

box

(i)

This pro ce du re of ﬁnding matching plays the same role as the heuristic assign-

ment rules used to match proposal [37] or anchors [22] to ground truth objects

in moder n detectors. The main di↵erence is that we need to ﬁnd one-to-one

matching for direct set predi ct i on wit h out dup li c ate s.

The second step is to compu t e the loss function, the Hungarian loss for all

pairs matched in the p re vi ou s step. We deﬁne the loss similarly to the losses of

common object d et ec t ors , i.e. a linear combination of a negative log-likelihood

for class prediction and a box loss deﬁned later:

Hungarian

(y, ˆy)=

i=1

 log ˆp

ˆ(i)

)+1

6=?}

box

ˆ

(i))

, (2)

where ˆ is the optimal assignment computed in the ﬁrst step (1). In practice, we

down-weight the log-probability term when c

= ? by a factor 10 to account for

6 Carion et al .

class imbalance. This is analogous to how Faster R-CNN training p rocedur e bal-

ances positive/negati ve proposal s by subsampling [37]. Notice that the matching

cost betwee n an object and ? doesn’ t depend on the pre di c ti on , which means

that in t h at case the cost is a constant. In the matching cost we use probabil-

ities ˆp

ˆ(i)

) instead of log-probabi l it i e s. This makes the class p r ed ic t ion term

commensurable to L

box

(·, ·) (de sc ri bed below), and we observed better empirical

performances.

Bounding box loss. The second part of the matching cost and the Hungarian

loss is L

box

(·) that scores the bounding boxes. Unlike many detectors that do box

predictions as a  w.r.t. some initial gue sse s, we make box predictions directly.

While such approach simplify the implementat i on it poses an issue with relative

scaling of t h e loss. The most commonly- us ed `

loss will have di↵erent scales for

small and large boxes even if their relative errors are similar. To mitigate this

issue we use a linear combination of the `

loss and the generalized IoU loss [38]

iou

(·, ·) that is sc al e-i nvariant. Overall, our box loss is L

box

(i)

) deﬁned as



iou

(i)

)+

||b



(i)

where 

iou

,

2 R are hyperparameters.

These two losses are normalized by the number of objects inside the batch.

3.2 DETR architecture

The overall DETR architecture is surprisingly simple an d depicted in Figure 2.It

contains three main components, which we describe below: a CNN backbone to

extract a compact feature repr es e ntation, an encoder-decoder transformer, and

a simple feed forward network (FFN) that makes the ﬁnal detection prediction.

Unlike many modern detectors, DETR can be implemented in any deep learn-

ing frame work that prov i de s a common CNN backbone and a transformer archi-

tecture implementation with just a few hundred lines. Inference code for DETR

can be implemente d in less than 50 lines in PyTorch [32]. We hope that the sim-

plicity of our method will attract new researchers to the detection community.

Backbone. Starting from the initial image x

img

2 R

3⇥H

⇥W

(with 3 color

channels

), a conventional CNN backbone generates a lower-resolution activation

map f 2 R

C⇥H⇥W

. Typical values we use are C = 2048 and H, W =

Transformer encoder. First, a 1x1 convolution reduces the channel dimension

of the high-l evel acti vation map f from C to a smal l er dimension d. creating a

new feature map z

2 R

d⇥H⇥W

. The encoder e xpects a sequence as input, hence

we collapse the spatial dimensions of z

into one dimension, resulting in a d⇥HW

feature map. Each encoder layer has a standard architecture and consists of a

multi-he ad self-attention module and a feed forward network (FFN). Since the

transformer architecture is permutation-invariant, we supplement it with ﬁxed

positional encodings [31, 3] that are added to the input of each attention layer. We

defer to the suppl ementary mate ri al the de tai l ed deﬁnition of the architecture,

which follows the one described in [47].

The input images are batched together, applying 0-padding ad e qu a t ely to ensure

they all have the same dimensions (H

) as the largest image of the batch.

剩余25页未读，继续阅读

Mrwei_418

粉丝: 165
资源: 4

DETR：Transformer在端到端目标检测中的应用

TOD-Net An end-to-end transformer-based object detection network

DETR(End-to-End Object Detection with Transformers （CVPR 20)相关代码

DETR- End-to-End Object Detection with Transformers 论文解析Yannic Kilcher版本

end-to-end object detection with transformers

End-to-End Object Detection with Transformers 文献汇报ppt

报告：End-to-End Object Detection with Transformers.pdf

End-to-End Object Detection with Transformers 文献汇报ppt.zip

End-to-End Object Detection with Transformers 目标检测论文组会汇报

end-to-end object detection with transformers csdn

参考文献格式end-to-end object detection with transformers

最新资源