高效视频对象分割：改进内存覆盖的时空网络再思考

版权申诉

97 浏览量更新于2024-07-06 收藏 3.51MB PDF 举报

"这篇论文提出了一种简单但有效的视频对象分割方法，通过改进的内存覆盖率重新思考时空网络，以提高效率。与大多数现有方法不同，该方法直接在帧之间建立对应关系，无需对每个对象的掩模特征进行重编码，从而创建了一个高效且鲁棒的框架。利用这些对应关系，当前查询帧中的每个节点通过关联方式聚合过去帧的特征进行推理。将聚合过程建模为投票问题，作者发现现有的内积亲和度导致内存使用效率低下，小（固定）子集的记忆节点在投票中占据主导，不论查询内容如何。鉴于此现象，他们建议使用负平方欧几里得距离来计算亲和度，以优化内存利用。" 本文深入研究了视频对象分割中的空间-时间网络，这是计算机视觉和机器学习领域的一个重要问题。传统的视频对象分割方法通常需要对每一帧的特征进行处理，这可能导致计算复杂度高、内存需求大，尤其是在处理长时间序列时。为了克服这些挑战，这篇论文提出了一个创新的框架，该框架直接在帧之间建立空间-时间对应，减少了重复计算，提高了处理效率。关键点1：空间-时间对应论文的核心是直接在视频帧之间建立空间-时间对应，而不是对每个对象的掩模特征进行重编码。这种方法简化了处理流程，减少了计算量，同时保持了对对象运动轨迹的有效跟踪。关键点2：关联特征聚合通过关联方式聚合过去帧的特征，当前帧中的每个节点可以得到有效的推理。这种聚合过程被视为投票问题，意味着来自历史帧的信息可以“投票”决定当前帧的分割结果。关键点3：投票问题与内存效率作者发现，现有的内积亲和度计算方式可能导致小部分记忆节点在投票中过于突出，而忽视了其他可能重要的信息。这降低了内存利用率，并可能影响分割的准确性。关键点4：负平方欧几里得距离为了解决这个问题，论文提出使用负平方欧几里得距离来计算节点间的亲和度。这种方式可以更均衡地考虑所有记忆节点，避免了小子集节点过度主导投票过程，从而优化了内存使用并提高了模型的性能。这篇论文对视频对象分割的时空网络进行了深入研究，通过改进的内存管理策略和亲和度计算方法，提升了效率和准确性，对于实时和大规模视频分析具有重要意义。这一工作对于计算机视觉和人工智能领域的研究人员来说，提供了新的思考方向和技术手段，有助于推动视频处理技术的进步。

Architecture.

Following the STM practice [

], we take

res4

features with stride 16 from the base

ResNets as our backbone features and discard

res5

. A

3×3

convolutional layer without non-linearity

is used as a projection head from the backbone feature to either the

key

space (

dimensional) or

the

value

space (

dimensional). We set

to be 512 following STM and discuss the choice of

in Section 4.1.

Feature reuse.

As seen from Figure 1, both the key encoder and the value encoder are processing

the same frame, albeit with different inputs. It is natural to reuse features from the key encoder (with

fewer inputs and a deeper network) at the value encoder. To avoid bloating the feature dimensions

and for simplicity, we concatenate the last layer features from both encoders (before the projection

head) and process them with two ResBlocks [

] and a CBAM block

[

] as the ﬁnal value output.

3.2 Memory Reading and Decoding

Given

memory frames and a query frame, the feature extraction step would generate the followings:

memory key

∈ R

×T HW

, memory value

∈ R

×T HW

, and query key

∈ R

×HW

where

and

are (stride 16) spatial dimensions. Then, for any similarity measure

c : R

×R

→

, we can compute the pairwise afﬁnity matrix

and the softmax-normalized afﬁnity matrix

where S, W ∈ R

T HW ×HW

with:

= c(k

, k

) W

exp (S

)

(exp (S

))

, (1)

where

denotes the feature vector at the

-th position. The similarities are normalized by

√

as in

standard practice [

] and is not shown for brevity. In STM [

], the dot product is used as

Memory reading regularization like KMN [22] or top-k ﬁltering [21] can be applied at this step.

With the normalized afﬁnity matrix

, the aggregated readout feature

∈ R

×HW

for the

query frame can be computed as a weighted sum of the memory features with an efﬁcient matrix

multiplication:

= v

W, (2)

which is then passed to the decoder for mask generation.

In the case of multi-object segmentation, only Equation 2 has to be repeated as

is deﬁned between

image features only, and thus is the same for different objects. In the case of STM [

must be

recomputed instead. Detailed running time analysis can be found in Section 6.2.

Decoder.

Our decoder structure stays close to that of the STM [

] as it is not the focus of this paper.

Features are processed and upsampled at a scale of two gradually with higher-resolution features

from the key encoder incorporated using skip-connections. The ﬁnal layer of the decoder produces a

stride 4 mask which is bilinearly upsampled to the original resolution. In the case of multiple objects,

soft aggregation [18] of the output masks is used.

3.3 Memory Management

So far we have assumed the existence of a memory bank of size

. Here, we will describe the

construction of the memory bank. For each memory frame, we store two items: memory key and

memory value. Note that all memory frames (except the ﬁrst one) are once query frames. The memory

key is simply reused from the query key, as described in Section 3.1 without extra computation. The

memory value is computed after mask generation of that frame, independently for each object as the

value encoder takes both the image and the object mask as inputs.

STM [

] consider every ﬁfth query frame as a memory frame, and the immediately previous frame

as a temporary memory frame to ensure accurate matching. In the case of STCN, we ﬁnd that it is

unnecessary, and in fact harmful, to include the last frame as temporary memory. This is a direct

consequence of using shared key encoders – 1) key features are sufﬁciently robust to match well

without the need for close-range (temporal) propagation, and 2) the temporary memory key would

otherwise be too similar to that of the query, as the image context usually changes smoothly and we do

not have the encoding noises resultant from distinct encoders, leading to drifting.

This modiﬁcation

also reduces the number of calls to the value encoder, contributing a signiﬁcant speedup.

We ﬁnd this block to be non-essential in a later experiment but it is kept for consistency.

This effect is ampliﬁed by the use of L2 similarity. See the supplementary material for a full comparison.

剩余18页未读，继续阅读

易小侠

粉丝: 6595
资源: 9万+

高效视频对象分割：改进内存覆盖的时空网络再思考

从自顶向下的角度重新思考视频对象分割中的跨模式交互_Rethinking Cross-modal Interaction fro

藏经阁-NO MORE _SBT ASSEMBLY__RETHINKING SPARK-SUBMIT USING CUESHEE

[IJCAI_2022,_Official_Code]_for_paper_Rethinking__TANet-image-

Statistical_rethinking-

信息安全_数据安全_law-f03_rethinking_employee_surv.pdf

重新思考位置编码_Rethinking Positional Encoding

重新思考关键点表示法将关键点和姿势建模为多人姿势估计的对象_Rethinking Keypoint Representation

statistical_rethinking

stats_rethinking_julia

用光谱注意重新思考图形变换器_Rethinking Graph Transformers with Spectral Atten

最新资源