没有合适的资源?快使用搜索试试~ 我知道了~
首页DEFORMABLE DETR:解决对象检测难题的变形Transformer
DEFORMABLE DETR:解决对象检测难题的变形Transformer
0 下载量 200 浏览量
更新于2024-06-19
收藏 4.25MB PDF 举报
DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION 是一篇于2021年在国际计算机视觉与模式识别大会(ICLR)上发表的研究论文。该论文提出了一种创新的深度学习方法,旨在改进对象检测任务中的端到端性能,尤其是针对小目标的识别能力,同时减少对传统设计组件的依赖。 DETR(Detected Transformers)是一种基于Transformer架构的模型,它在对象检测领域展示了良好的性能,但存在两个关键挑战。首先,Transformer注意力模块处理图像特征图时,由于其全局关注机制,导致收敛速度较慢,这限制了模型在实际应用中的效率。其次,Transformer的固定大小的注意力窗口可能无法捕捉到图像中的细节,特别是对于小目标,这影响了检测精度。 为了克服这些问题,研究人员提出了Deformable DETR,其核心创新在于注意力模块被设计为只集中在参考点周围的有限数量的键上。这种“变形”注意力机制使得模型能够更精确地聚焦在潜在的目标区域,从而加快了训练过程,尤其是在处理小目标时,训练时间可以减少10倍。这种方法的优势在于提升了模型的空间分辨率,减少了计算负担,并提高了整体性能。 实验结果在COCO基准数据集上进行了全面验证,显示了Deformable DETR显著优于原始DETR,特别是在小目标检测方面。作者还提供了代码,以便研究者们可以在GitHub上获取并进一步探索这一创新技术。这篇论文不仅推动了对象检测领域的技术进步,也为Transformer架构在图像处理任务中的优化提供了一个重要的新视角。
资源详情
资源推荐
![](https://csdnimg.cn/release/download_crawler_static/88567948/bg4.jpg)
Published as a conference paper at ICLR 2021
where m indexes the attention head, W
0
m
∈ R
C
v
×C
and W
m
∈ R
C×C
v
are of learnable weights
(C
v
= C/M by default). The attention weights A
mq k
∝ exp{
z
T
q
U
T
m
V
m
x
k
√
C
v
} are normalized as
P
k∈Ω
k
A
mq k
= 1, in which U
m
, V
m
∈ R
C
v
×C
are also learnable weights. To disambiguate
different spatial positions, the representation features z
q
and x
k
are usually of the concatena-
tion/summation of element contents and positional embeddings.
There are two known issues with Transformers. One is Transformers need long training schedules
before convergence. Suppose the number of query and key elements are of N
q
and N
k
, respectively.
Typically, with proper parameter initialization, U
m
z
q
and V
m
x
k
follow distribution with mean of
0 and variance of 1, which makes attention weights A
mq k
≈
1
N
k
, when N
k
is large. It will lead
to ambiguous gradients for input features. Thus, long training schedules are required so that the
attention weights can focus on specific keys. In the image domain, where the key elements are
usually of image pixels, N
k
can be very large and the convergence is tedious.
On the other hand, the computational and memory complexity for multi-head attention can be
very high with numerous query and key elements. The computational complexity of Eq. 1 is of
O(N
q
C
2
+ N
k
C
2
+ N
q
N
k
C). In the image domain, where the query and key elements are both of
pixels, N
q
= N
k
C, the complexity is dominated by the third term, as O(N
q
N
k
C). Thus, the
multi-head attention module suffers from a quadratic complexity growth with the feature map size.
DETR. DETR (Carion et al., 2020) is built upon the Transformer encoder-decoder architecture,
combined with a set-based Hungarian loss that forces unique predictions for each ground-truth
bounding box via bipartite matching. We briefly review the network architecture as follows.
Given the input feature maps x ∈ R
C×H×W
extracted by a CNN backbone (e.g., ResNet (He et al.,
2016)), DETR exploits a standard Transformer encoder-decoder architecture to transform the input
feature maps to be features of a set of object queries. A 3-layer feed-forward neural network (FFN)
and a linear projection are added on top of the object query features (produced by the decoder) as
the detection head. The FFN acts as the regression branch to predict the bounding box coordinates
b ∈ [0, 1]
4
, where b = {b
x
, b
y
, b
w
, b
h
} encodes the normalized box center coordinates, box height
and width (relative to the image size). The linear projection acts as the classification branch to
produce the classification results.
For the Transformer encoder in DETR, both query and key elements are of pixels in the feature maps.
The inputs are of ResNet feature maps (with encoded positional embeddings). Let H and W denote
the feature map height and width, respectively. The computational complexity of self-attention is of
O(H
2
W
2
C), which grows quadratically with the spatial size.
For the Transformer decoder in DETR, the input includes both feature maps from the encoder, and
N object queries represented by learnable positional embeddings (e.g., N = 100). There are two
types of attention modules in the decoder, namely, cross-attention and self-attention modules. In the
cross-attention modules, object queries extract features from the feature maps. The query elements
are of the object queries, and key elements are of the output feature maps from the encoder. In it,
N
q
= N, N
k
= H × W and the complexity of the cross-attention is of O(HW C
2
+ NHW C).
The complexity grows linearly with the spatial size of feature maps. In the self-attention modules,
object queries interact with each other, so as to capture their relations. The query and key elements
are both of the object queries. In it, N
q
= N
k
= N, and the complexity of the self-attention module
is of O(2NC
2
+ N
2
C). The complexity is acceptable with moderate number of object queries.
DETR is an attractive design for object detection, which removes the need for many hand-designed
components. However, it also has its own issues. These issues can be mainly attributed to the
deficits of Transformer attention in handling image feature maps as key elements: (1) DETR has
relatively low performance in detecting small objects. Modern object detectors use high-resolution
feature maps to better detect small objects. However, high-resolution feature maps would lead to an
unacceptable complexity for the self-attention module in the Transformer encoder of DETR, which
has a quadratic complexity with the spatial size of input feature maps. (2) Compared with modern
object detectors, DETR requires many more training epochs to converge. This is mainly because
the attention modules processing image features are difficult to train. For example, at initialization,
the cross-attention modules are almost of average attention on the whole feature maps. While, at
the end of the training, the attention maps are learned to be very sparse, focusing only on the object
4
剩余15页未读,继续阅读
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://profile-avatar.csdnimg.cn/7df4b1fcbd044994980fe83710b52cc8_studentyingjie.jpg!1)
DrYJ
- 粉丝: 40
- 资源: 24
上传资源 快速赚钱
我的内容管理 收起
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![](https://csdnimg.cn/release/wenkucmsfe/public/img/voice.245cc511.png)
会员权益专享
最新资源
- 京瓷TASKalfa系列维修手册:安全与操作指南
- 小波变换在视频压缩中的应用
- Microsoft OfficeXP详解:WordXP、ExcelXP和PowerPointXP
- 雀巢在线媒介投放策划:门户网站与广告效果分析
- 用友NC-V56供应链功能升级详解(84页)
- 计算机病毒与防御策略探索
- 企业网NAT技术实践:2022年部署互联网出口策略
- 软件测试面试必备:概念、原则与常见问题解析
- 2022年Windows IIS服务器内外网配置详解与Serv-U FTP服务器安装
- 中国联通:企业级ICT转型与创新实践
- C#图形图像编程深入解析:GDI+与多媒体应用
- Xilinx AXI Interconnect v2.1用户指南
- DIY编程电缆全攻略:接口类型与自制指南
- 电脑维护与硬盘数据恢复指南
- 计算机网络技术专业剖析:人才培养与改革
- 量化多因子指数增强策略:微观视角的实证分析
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
![](https://img-home.csdnimg.cn/images/20220527035711.png)
![](https://img-home.csdnimg.cn/images/20220527035711.png)
![](https://img-home.csdnimg.cn/images/20220527035111.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![](https://csdnimg.cn/release/wenkucmsfe/public/img/green-success.6a4acb44.png)