DEFORMABLE DETR：解决对象检测难题的变形Transformer

人工智能

200 浏览量更新于2024-06-19 收藏 4.25MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Published as a conference paper at ICLR 2021

where m indexes the attention head, W

∈ R

×C

and W

∈ R

C×C

are of learnable weights

= C/M by default). The attention weights A

mq k

∝ exp{

√

} are normalized as

k∈Ω

mq k

= 1, in which U

, V

∈ R

×C

are also learnable weights. To disambiguate

different spatial positions, the representation features z

and x

are usually of the concatena-

tion/summation of element contents and positional embeddings.

There are two known issues with Transformers. One is Transformers need long training schedules

before convergence. Suppose the number of query and key elements are of N

and N

, respectively.

Typically, with proper parameter initialization, U

and V

follow distribution with mean of

0 and variance of 1, which makes attention weights A

mq k

≈

, when N

is large. It will lead

to ambiguous gradients for input features. Thus, long training schedules are required so that the

attention weights can focus on speciﬁc keys. In the image domain, where the key elements are

usually of image pixels, N

can be very large and the convergence is tedious.

On the other hand, the computational and memory complexity for multi-head attention can be

very high with numerous query and key elements. The computational complexity of Eq. 1 is of

O(N

+ N

C). In the image domain, where the query and key elements are both of

pixels, N

= N

 C, the complexity is dominated by the third term, as O(N

C). Thus, the

multi-head attention module suffers from a quadratic complexity growth with the feature map size.

DETR. DETR (Carion et al., 2020) is built upon the Transformer encoder-decoder architecture,

combined with a set-based Hungarian loss that forces unique predictions for each ground-truth

bounding box via bipartite matching. We brieﬂy review the network architecture as follows.

Given the input feature maps x ∈ R

C×H×W

extracted by a CNN backbone (e.g., ResNet (He et al.,

2016)), DETR exploits a standard Transformer encoder-decoder architecture to transform the input

feature maps to be features of a set of object queries. A 3-layer feed-forward neural network (FFN)

and a linear projection are added on top of the object query features (produced by the decoder) as

the detection head. The FFN acts as the regression branch to predict the bounding box coordinates

b ∈ [0, 1]

, where b = {b

, b

} encodes the normalized box center coordinates, box height

and width (relative to the image size). The linear projection acts as the classiﬁcation branch to

produce the classiﬁcation results.

For the Transformer encoder in DETR, both query and key elements are of pixels in the feature maps.

The inputs are of ResNet feature maps (with encoded positional embeddings). Let H and W denote

the feature map height and width, respectively. The computational complexity of self-attention is of

O(H

C), which grows quadratically with the spatial size.

For the Transformer decoder in DETR, the input includes both feature maps from the encoder, and

N object queries represented by learnable positional embeddings (e.g., N = 100). There are two

types of attention modules in the decoder, namely, cross-attention and self-attention modules. In the

cross-attention modules, object queries extract features from the feature maps. The query elements

are of the object queries, and key elements are of the output feature maps from the encoder. In it,

= N, N

= H × W and the complexity of the cross-attention is of O(HW C

+ NHW C).

The complexity grows linearly with the spatial size of feature maps. In the self-attention modules,

object queries interact with each other, so as to capture their relations. The query and key elements

are both of the object queries. In it, N

= N

= N, and the complexity of the self-attention module

is of O(2NC

+ N

C). The complexity is acceptable with moderate number of object queries.

DETR is an attractive design for object detection, which removes the need for many hand-designed

components. However, it also has its own issues. These issues can be mainly attributed to the

deﬁcits of Transformer attention in handling image feature maps as key elements: (1) DETR has

relatively low performance in detecting small objects. Modern object detectors use high-resolution

feature maps to better detect small objects. However, high-resolution feature maps would lead to an

unacceptable complexity for the self-attention module in the Transformer encoder of DETR, which

has a quadratic complexity with the spatial size of input feature maps. (2) Compared with modern

object detectors, DETR requires many more training epochs to converge. This is mainly because

the attention modules processing image features are difﬁcult to train. For example, at initialization,

the cross-attention modules are almost of average attention on the whole feature maps. While, at

the end of the training, the attention maps are learned to be very sparse, focusing only on the object

剩余15页未读，继续阅读

DrYJ

粉丝: 40
资源: 24

会员权益专享

DEFORMABLE DETR：解决对象检测难题的变形Transformer

Deformable DETR

TensorRT部署-使用TensorRT部署Deformable-DETR-Transformer-项目分享-附完整流程教程

Deformable-DETR:可变形的DETR

SAM-DETR算法参考文献

介绍基于Transformer的目标检测算法

Deformable Attention

icml2021目标检测

transformer用于缺陷检测

DEFORMABLE DETR代码复现

deformable detr代码复现

Deformable DETR相比于DETR的优势

deformable detr 和传统detr区别

deformable detr 和detr的区别

你知道deformable detr吗

DEFORMABLE DETR比yolov8好吗

deformable detr的encoder输出是多尺度的特征吗

Deformable DETR的核心原理是什么？

介绍deformable detr

ModuleNotFoundError: No module named 'deformable_detr'

deformable detr

会员权益专享

最新资源