深度循环卷积网络在视觉识别与描述中的应用

10 浏览量更新于2024-08-25 收藏 4.39MB PDF 举报

"这篇论文是关于Long-term Recurrent Convolutional Networks在视觉识别与描述中的应用，由Jeff Donahue等人发表，主要探讨了深度卷积网络与循环神经网络的结合在处理序列任务（如视频识别、图像描述和检索、视频叙述等）中的优势。" 在计算机视觉领域，深度卷积网络（Deep Convolutional Networks, DCNs）已经成为图像解释任务的核心技术，取得了显著的成就。然而，尽管DCNs在静态图像处理方面表现出色，但它们在处理涉及时间序列的视觉数据时可能会遇到挑战，因为它们无法有效地捕捉和利用时间维度的信息。为了解决这一问题，该论文引入了一种新的**Long-term Recurrent Convolutional Network（LRCN）**架构，这是一种能够进行端到端训练的模型，旨在同时利用空间和时间信息。 LRCN结合了卷积神经网络（CNNs）和长短期记忆网络（LSTM），形成一种"双重深度"结构。CNNs用于提取图像或视频帧的时空特征，而LSTM则负责处理这些特征的序列性，允许模型学习长期依赖关系。这与传统方法不同，传统方法通常假设固定的时空感受野或采用简单的时序平均来处理序列数据。在实验部分，LRCN在一系列基准任务上展示了其效果。首先，在视频识别任务中，它能更准确地识别序列中的动作和事件。其次，对于图像描述和检索问题，LRCN可以生成更准确和连贯的文本描述，因为它能够考虑上下文信息的变化。最后，LRCN还在视频叙述挑战中表现出色，能够生成与视频内容相匹配的自然语言叙述。通过这些实验证明，LRCN架构不仅扩展了深度学习模型的能力，使其适应于时序数据，还提高了模型在处理动态视觉场景时的性能。这种创新的结合方式为未来在视觉识别和理解、视频分析、自动驾驶、监控系统等领域的发展提供了新的可能性，推动了计算机视觉技术的进步。

As research on LSTMs has progressed, hidden units with

varying connections within the memory unit have been pro-

posed. We use the LSTM unit as described in [45] (Figure 2,

right), which is a slight simpliﬁcation of the one described

in [10]. Letting σ(x) = (1 + e

−x

)

−1

be the sigmoid non-

linearity which squashes real-valued inputs to a [0, 1] range,

and letting φ(x) =

−e

−x

= 2σ(2x) − 1 be the hyper-

bolic tangent nonlinearity, similarly squashing its inputs to

a [−1, 1] range, the LSTM updates for timestep t given in-

puts x

, h

t−1

, and c

t−1

are:

= σ(W

+ W

t−1

+ b

)

= σ(W

+ W

t−1

+ b

)

= σ(W

+ W

t−1

+ b

)

= φ(W

+ W

t−1

+ b

)

= f

 c

t−1

+ i

 g

= o

 φ(c

)

t-1

Output

Cell

RNN Unit

t-1

= z

Cell

Output

Gate

Input

Gate

Forget

Gate

Input

Modulation

Gate

LSTM Unit

Figure 2: A diagram of a basic RNN cell (left) and an LSTM mem-

ory cell (right) used in this paper (from [45], a slight simpliﬁcation

of the architecture described in [9], which was derived from the

LSTM initially proposed in [12]).

In addition to a hidden unit h

∈ R

, the LSTM in-

cludes an input gate i

∈ R

, forget gate f

∈ R

, output

gate o

∈ R

, input modulation gate g

∈ R

, and mem-

ory cell c

∈ R

. The memory cell unit c

is a summation

of two things: the previous memory cell unit c

t−1

which

is modulated by f

, and g

, a function of the current input

and previous hidden state, modulated by the input gate i

Because i

and f

are sigmoidal, their values lie within the

range [0, 1], and i

and f

can be thought of as knobs that

the LSTM learns to selectively forget its previous memory

or consider its current input. Likewise, the output gate o

learns how much of the memory cell to transfer to the hid-

den state. These additional cells enable the LSTM to learn

extremely complex and long-term temporal dynamics the

RNN is not capable of learning. Additional depth can be

added to LSTMs by stacking them on top of each other, us-

ing the hidden state of the LSTM in layer l − 1 as the input

to the LSTM in layer l.

Recently, LSTMs have achieved impressive results on

language tasks such as speech recognition [10] and ma-

chine translation [38, 5]. Analogous to CNNs, LSTMs are

attractive because they allow end-to-end ﬁne-tuning. For

example, [10] eliminates the need for complex multi-step

pipelines in speech recognition by training a deep bidirec-

tional LSTM which maps spectrogram inputs to text. Even

with no language model or pronunciation dictionary, the

model produces convincing text translations. [38] and [5]

translate sentences from English to French with a multi-

layer LSTM encoder and decoder. Sentences in the source

language are mapped to a hidden state using an encoding

LSTM, and then a decoding LSTM maps the hidden state

to a sequence in the target language. Such an encoder de-

coder scheme allows sequences of different lengths to be

mapped to each other. Like [10] the sequence-to-sequence

architecture for machine translation circumvents the need

for language models.

The advantages of LSTMs for modeling sequential data

in vision problems are twofold. First, when integrated with

current vision systems, LSTM models are straightforward

to ﬁne-tune end-to-end. Second, LSTMs are not conﬁned

to ﬁxed length inputs or outputs allowing simple modeling

for sequential data of varying lengths, such as text or video.

We next describe a uniﬁed framework to combine LSTMs

with deep convolutional networks to create a model which

is both spatially and temporally deep.

3. Long-term Recurrent Convolutional Net-

work (LRCN) model

This work proposes a Long-term Recurrent Convolu-

tional Network (LRCN) model combinining a deep hier-

archical visual feature extractor (such as a CNN) with a

model that can learn to recognize and synthesize temporal

dynamics for tasks involving sequential data (inputs or out-

puts), visual, linsguistical or otherwise. Figure 1 depicts the

core of our approach. Our LRCN model works by pass-

ing each visual input v

(an image in isolation, or a frame

from a video) through a feature transformation φ

)

parametrized by V to produce a ﬁxed-length vector rep-

resentation φ

∈ R

. Having computed the feature-space

representation of the visual input sequence hφ

, φ

, ..., φ

the sequence model then takes over.

In its most general form, a sequence model parametrized

by W maps an input x

and a previous timestep hidden state

t−1

to an output z

and updated hidden state h

. There-

fore, inference must be run sequentially (i.e., from top to

bottom, in the Sequence Learning box of Figure 1), by

computing in order: h

= f

, h

) = f

, 0), then

= f

, h

), etc., up to h

. Some of our models stack

multiple LSTMs atop one another as described in Section 2.

The ﬁnal step in predicting a distribution P (y

) at

timestep t is to take a softmax over the outputs z

of the

sequential model, producing a distribution over the (in our

case, ﬁnite and discrete) space C of possible per-timestep

剩余12页未读，继续阅读

weixin_38737630

粉丝: 1
资源: 928

深度循环卷积网络在视觉识别与描述中的应用

Convolutional Neural Networks for Visual Recognition 5

CosRec- 2D Convolutional Neural Networks for Sequential Recommendation-重点分析.pdf

Recurrent Neural Networks and Long Short-Term Memory .pdf

Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution

long-short-term-memory-networks-with-python.7z

[machine_learning_mastery系列]long-short-term-memory-networks-with-python.pdf

Fluctuation-driven learning rule for continuous-time recurrent neural networks and its application to dynamical system control

Continuous-Time Recurrent Neural Networks

long-short-term-memory-networks-with-python

Recurrent Convolutional Neural Networks for Text Classification

最新资源