关系网络增强：融合视觉表示提升对象检测性能

39 浏览量更新于2024-08-03 收藏 611KB PDF 举报

标题："RelationNet++: Bridging Visual Representations for Object Detection" 是一篇深度学习领域的研究论文，主要关注于解决现有物体检测框架中存在的问题。传统物体检测模型如RetinaNet、Faster R-CNN、FCOS和CornerNet分别依赖于锚点/提议框、中心点和角点等不同形式的物体表示，这些不同的表示在分类精度和局部定位上各有优势。然而，将这些表示整合到一个单一框架中，以便充分利用各自的优势，是一个具有挑战性的问题。由于不同表示采用的特征提取方式异质化或非网格化，这使得它们难以直接融合。作者提出了RelationNet++，这是一个基于注意力机制的解码器模块，灵感来源于Transformer[31]。该模块设计了一个端到端的物体检测框架，旨在将各种视觉表示（如锚点、中心点和角点）桥接到一个统一的表示格式上。这样做的目的是克服不同表示之间的兼容性和数据处理上的差异，使得模型能够在一个单一架构中实现更好的性能。在 RelationNet++ 中，注意力机制被用来在编码后的特征图上对不同类型的表示进行权重分配，允许模型动态地结合来自不同表示的信息。通过这种方式，框架可以更好地整合不同表示的优势，提高整体的物体检测准确性和鲁棒性。此外，这种设计还可能促进跨任务学习和多模态信息的融合，进一步提升模型的泛化能力。论文的核心贡献包括： 1. **注意力解码器设计**：引入Transformer-like结构，能够有效地处理不同类型的物体表示，增强模型在物体检测任务中的表现。 2. **端到端集成**：将多种表示形式无缝整合到一个单一的检测框架中，简化了模型架构，提高了效率。 3. **多模态信息融合**：通过注意力机制，模型能够处理来自不同视觉表示的数据，提升了整体性能。 RelationNet++ 是一种创新的物体检测方法，它通过构建一个统一的解码器模块，有效地解决了多种视觉表示之间的兼容性和优化问题，为更高效、全面的物体检测提供了新的思路。

Object Center

Detection

(a) Faster R-CNN

Anchor

Proposal

Detection

Anchor

Detection

(b) RetinaNet

Corner Points Grouping

(d) CornerNet

Figure 2: Representation ﬂows for several typical detection frameworks.

bounding box can be described by a

-d vector, either as center-size

, y

, w, h)

or as opposing

corners

, y

, x

, y

)

. Besides the ﬁnal output, this representation is also commonly used as initial

and intermediate object representations, such as anchors [

] and proposals [

]. For bounding box representations, features are usually extracted by pooling operators within

the bounding box area on an image feature map. Common pooling operators include RoIPool [

RoIAlign [

], and Deformable RoIPool [

]. There are also simpliﬁed feature extraction methods,

e.g., the box center features are usually employed in the anchor box representation [24, 18].

Object center representation

The

-d vector space of a bounding box representation is at a scale

O(H

× W

)

for an image with resolution

H × W

, which is too large to fully process. To

reduce the representation space, some recent frameworks [

] use the center point

as a simpliﬁed representation. Geometrically, a center point is described by a 2-d vector (x

, y

), in

which the hypothesis space is of the scale

O(H × W )

, which is much more tractable. For a center

point representation, the image feature on the center point is usually employed as the object feature.

Corner representation

A bounding box can be determined by two points, e.g., a top-left corner and

a bottom-right corner. Some approaches [

] ﬁrst detect these individual points

and then compose bounding boxes from them. We refer to these representation methods as corner

representation. The image feature at the corner location can be employed as the part feature.

Summary and comparison

Different representation approaches usually have strengths in different

aspects. For example, object based representations (bounding box and center) are better in category

classiﬁcation while worse in object localization than part based representations (corners). Object

based representations are also more friendly for end-to-end learning because they do not require

a post-processing step to compose objects from corners as in part-based representation methods.

Comparing different object-based representations, while the bounding box representation enables

more sophisticated feature extraction and multiple-stage processing, the center representation is

attractive due to the simpliﬁed system design.

2.2 Object Detection Frameworks in a Representation View

Object detection methods can be seen as evolving intermediate object/part representations until the

ﬁnal bounding box outputs. The representation ﬂows largely shape different object detectors. Several

major categorization of object detectors are based on such representation ﬂow, such as top-down

(object-based representation) vs bottom-up (part-based representation), anchor-based (bounding

box based) vs anchor-free (center point based), and single-stage (one-time representation ﬂow) vs

multiple-stage (multiple-time representation ﬂow). Figure 2 shows the representation ﬂows of several

typical object detection frameworks, as detailed below.

Faster R-CNN

[

] employs bounding boxes as its intermediate object representations in all stages.

At the beginning, multiple anchor boxes at each feature map position are hypothesized to coarsely

cover the 4-d bounding box space in an image, i.e.,

anchor boxes with different aspect ratios.

The image feature vector at the center point is extracted to represent each anchor box, which is

then used for foreground/background classiﬁcation and localization reﬁnement. After anchor box

selection and localization reﬁnement, the object representation is evolved to a set of proposal boxes,

where the object features are usually extracted by an RoIAlign operator within each box area. The

ﬁnal bounding box outputs are obtained by localization reﬁnement, through a small network on the

proposal features.

RetinaNet

[

] is a one-stage object detector, which also employs bounding boxes as its intermediate

representation. Due to its one-stage nature, it usually requires denser anchor hypotheses, i.e.,

anchor

boxes at each feature map position. The ﬁnal bounding box outputs are also obtained by applying a

localization reﬁnement head network.

剩余10页未读，继续阅读

DrYJ

粉丝: 40
资源: 24

关系网络增强：融合视觉表示提升对象检测性能

consensus bridging theory and practice

Action-oriented process mining: bridging the gap between insight

Home-school token economies: Bridging the communication gap

Referral of the child with learning problems: Bridging a communication gap

Bridging the I/O Performance Gap for Big Data Workloads: A New NVDIMM-based Approach

bridging-vclaim:衔接声望menggunakan laravel 8

Network Bridging Utility:Windows网络桥接实用程序-开源

Bridging The Gap

bridging-bpjs

Toll-Free-Bridging:实现您自己的免费桥接生态系统的一等公民

最新资源