RNN注意力序列视图学习3D全局特征的新方法：SeqViews2SeqLabels

研究论文

59 浏览量更新于2024-08-26 收藏 2.89MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

SeqViews2SeqLabels是一项针对3D形状分析的重要研究，它提出了一种创新的深度学习模型，以解决传统视图聚合方法中的关键问题。在当前基于深度学习的端到端训练模型中，池化（如最大池化或平均池化）是一种常见的视图聚合策略，它通过取所有视图的最大值或平均值来提炼全局特征。然而，这种方法存在两个主要缺点：首先，它忽略了几乎所有视图的内容信息，因为仅保留了最高或最低的数值表示；其次，它未能捕捉到各个视图之间的空间关系。为了解决这些问题，SeqViews2SeqLabels的设计者们引入了基于循环神经网络（Recurrent Neural Networks, RNNs）和注意力机制的编码器-解码器架构。该模型的主要组成部分包括一个编码器-RNN，负责处理和提取输入序列视图的特征，以及一个解码器-RNN，用于生成最终的标签序列，同时结合注意力机制来指导对输入视图的序列依赖性和重要性进行动态关注。编码器部分通过逐个处理每个视图，利用RNN单元的记忆功能来捕捉序列内的模式和时序信息，而不仅仅是单个视图的特征。注意力机制在此过程中起到关键作用，它允许模型根据当前的解码状态动态调整对不同输入视图的重视程度，确保重要的信息被充分考虑。这既保留了视图内容的多样性，又考虑到它们之间的空间关系，从而学习到了更为丰富的3D全局特征。解码器则基于编码器的输出和先前生成的标签，逐步生成最终的标签序列，这不仅体现了序列信息的传递，也反映了对输入视图序列的理解和综合。通过这种结构，SeqViews2SeqLabels能够有效地提升3D形状分析的准确性和鲁棒性，适用于诸如3D物体识别、姿态估计和形状分类等任务。 SeqViews2SeqLabels的研究论文提供了一个新颖的深度学习框架，通过RNN和注意力机制在3D形状分析中挖掘和整合视图的顺序信息，为解决传统视图聚合方法的局限性开辟了新的路径。这种模型在处理复杂三维数据时展现出了强大的特征学习能力，为计算机视觉领域的研究和发展带来了实质性的进步。

资源详情

资源推荐

660 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 2, FEBRUARY 2019

aggregate the information captured from different orientations.

To speed up the learning from voxels by deep learning models,

Wang et al. [24] proposed O-CNN to learn global features

based on a novel octree data structure. To learn local features

from voxles, Han et al. [12] proposed a novel voxelization

permutation strategy to eliminate the effect of rotation and

orientation ambiguity on the 3D surface. Although voxel-based

methods have the advantage of generating 3D shapes, they not

only need heavy computational cost but also require 3D shapes

to be aligned. In addition, this kind of methods always perform

discriminating shapes worse than the following view-based

methods.

C. View-Based Methods

Light Field Descriptor (LFD) [25] is the pioneer view-based

3D descriptor, which employs features of 2D silhouettes

in multiple views of 3D shapes. Instead of aggregating

multi-view information into global features, LFD evaluates the

dissimilarity between two shapes via comparing 2D features

of their corresponding two view sets in a greedy way. By the

same strategy, GIFT [5] measures the difference between two

shapes by the Hausdorff distance between their corresponding

view sets. To bridge 2D sketches and 3D shapes for shape

retrieval, barycentric representations of 3D shapes were pro-

posed to be learned from multiple views [26].

DeepPano [6] was proposed to learn features from

PANORAMA views using CNN, where a PANORAMA view

can be regarded as the seamless aggregation of multiple views

captured on a circle. To eliminate the effect of rotation about

the up-oriented direction, row-wise max pooling was intro-

duced in DeepPano. With pose normalization, Sﬁkas et al. [27]

used CNN to learn 3D features from multiple PANORAMA

views which were stacked together in a consistent order.

Similarly, using another hand-crafted feature, geometry image,

Sinha et al. [28] proposed to learn 3D features from geom-

etry images. In addition, RotationNet [29] is proposed to

learn global features by treating pose labels as latent vari-

ables which are optimized to self-align in an unsupervised

manner.

Recently, Su et al. [3] proposed Multi-View CNN to learn

3D global features from multiple views. To describe a 3D

shape by multiple views, the content information within

multiple views is aggregated into the global feature through

max pooling. Similarly, max pooling is also employed to

aggregate multiple views to learn local features for shape

segmentation or correspondence [4]. To employ more content

information in each view, Savva et al. [30] concatenated all

view features for hierarchical abstraction in the CNN-based

model. By decomposing a view sequence into a set of view

pairs, Johns et al. [31] classiﬁed each view pair independently,

and then, learned an object classiﬁer by weighting the contri-

bution of each view pair, which allowed 3D shape recognition

over arbitrary camera trajectories. To perform pooling more

efﬁciently, Wang et al. [8] proposed dominant set clustering

to cluster views token form each shape, where pooling is

performed in each cluster.

Although pooling resolves the effect of rotation of 3D

shapes, it still suffers from two kinds of information loss,

i.e., the content information of almost all views and the

spatial information among the views. The spatial information

between pairwise views is also disregarded by the view pair

decomposition [31]. Savva et al. [30] compensated these two

kinds of loss by concatenation of all views, however, it is

sensitive to the ﬁrst view position.

To resolve the aforementioned issues, SeqViews2SeqLabels

is proposed to learn 3D features via aggregating sequential

views by RNN. The RNN-based aggregation not only pre-

serves the content information of all views and the spatial

information among the views, but also becomes capable of

learning the semantics of view sequence, which is robust to

the ﬁrst view position.

D. CNN-RNN Based and RNN-RNN Based Models

SeqViews2SeqLabels is similar to CNN-RNN based and

RNN-RNN based models. Different from multiple views,

Miyagi and Aono [32] employed multiple voxel slices to

learn 3D global features. They used CNN to extract the

feature of each voxel slice, and then, used RNN for view

aggregation, where a softmax was employed to conduct 3D

shape classiﬁcation. Using a two-layer RNN, Le et al. [33]

proposed a CNN-RNN model to segment 3D shapes, where

multiple edge images were predicted to estimate the different

parts on a 3D shape. In addition, RNN-RNN based models,

especially seq2seq models, were originally proposed for text

understanding. Due to their powerful learning ability, they

have been successfully employed for image and speech under-

standing, such as scene text recognition [34], [35], image

caption generation [36] and speech recognition [37]. The

models in [34]–[36] were proposed to recognize what are in

a single image. For example, [34] and [35] focused on how

to recognize the characters in an image, [36] focused on how

to recognize the concepts in an image. Different from their

tasks, SeqViews2SeqLabels recognize what a view sequence of

multiple views is. This difference makes the involved attention

play different roles. In our method, we want to use attention

to highlight the views with distinctive characteristics to each

shape class and depress the views with ambiguous appearance.

Thus

our attention weights are computed at the image level.

In the methods of [35] and [36], attention is used to highlight

the parts with a speciﬁed meaning in an image, although mul-

tiple feature maps are involved. Thus, their attention weights

are computed at the part level. To represent the characteristics

of each shape class at each step of decoder, we propose a novel

attention mechanism which is different from the one employed

in [35] and [37].

III. S

EQVIEWS2SEQLABELS

In this section, SeqViews2SeqLabels is introduced in detail.

First, we provide an overview and then describe the key

elements, including capturing sequential views, view feature

extraction, the encoder-RNN, the decoder-RNN, and the atten-

tion mechanism in the subsequent ﬁve subsections.

剩余12页未读，继续阅读

weixin_38668160

粉丝: 10
资源: 936

RNN注意力序列视图学习3D全局特征的新方法：SeqViews2SeqLabels

RNN深度学习网络的训练matlab仿真+操作视频

Pytorch项目实战 ：基于RNN的实现情感分析

AttributeError:'RNN' object has no attribute 'init_hidden'

写一下电影评论情感分析：使用RNN模型对电影评论进行情感分析

电影评论情感分析：使用RNN模型对电影评论进行情感分析

202 RuntimeError: cudnn RNN backward can only be called in training mode

Transformer相比RNN有什么优势？

transformer和rnn区别

AttributeError: 'RNN' object has no attribute 'weight_ih'

RNN+LSTMGRU+注意力机制模型的具体构建过程

循环神经网络（RNN）有哪些应用

基于RNN的情感分析

怎样将全局特征转化为局部特征

基于RNN的聊天机器人 -题目范畴：RNN, 对话系统

RNN对比LSTM的不足之处

ValueError: Variable rnn/basic_lstm_cell/kernel already exists, disallowed. Did you mean to set reus...

基于循环神经网络算法项目

RNN神经网络的特点

RNN网络的适用范围

RNN的应用和使用场景

最新资源

Pytorch项目实战：基于RNN的实现情感分析