深度残差输出层优化神经语言生成

需积分: 0 93 浏览量更新于2024-08-05 收藏 381KB PDF 举报

"本文探讨了深度残差输出层在神经语言生成中的应用，旨在改进模型结构，特别是对于大型且稀疏的输出标签空间的学习。作者Nikolaos Pappas和James Henderson提出了一种深度残差输出映射，通过在层间引入Dropout来更好地捕捉输出空间的结构并防止过拟合。实验结果显示，这种方法可以与最先进的循环和自注意力架构相媲美或超越，表明分类器不一定要有高秩，只要能更好地捕获输出空间结构，就能更好地建模自然语言。" 深度残差网络（Residual Networks）在计算机视觉领域取得了显著的成功，其核心思想是通过短路机制解决梯度消失和深度网络训练的困难。在本文中，这个概念被扩展到神经语言生成任务，特别是在处理大规模且稀疏的输出标签空间时。传统的神经语言模型通常在分类器权重中间接捕获输出空间结构，但这种方式往往缺乏参数共享，容易导致过拟合。为了解决这个问题，作者提出了深度残差输出层（Deep Residual Output Layers）。这种结构引入了共享的输出标签映射，增强了模型表达力，同时在层间应用Dropout策略以减少过拟合风险。Dropout是一种正则化技术，通过在训练过程中随机丢弃一部分神经元，强制网络学习更鲁棒的特征表示，从而提高泛化能力。实验部分，研究者在三个语言生成任务上验证了他们的方法：这些任务可能包括机器翻译、文本摘要或者对话生成等。结果显示，提出的深度残差输出映射不仅与当前最佳的循环神经网络（RNNs）和自注意力架构（如Transformer）表现相当，甚至在某些情况下有所超越。这表明，即使分类器的秩不高，只要能够更有效地捕获输出空间的结构，就能在自然语言建模中取得优异的效果。该研究揭示了在神经语言生成中，通过改进输出层的结构，特别是采用深度残差和Dropout相结合的方式，可以提高模型对复杂输出空间结构的理解，并提升生成质量。这一发现为优化神经语言模型提供了新的视角，对于未来自然语言处理领域的研究具有重要的指导意义。

Deep Residual Output Layers for Neural Language Generation

discussed so far, assuming a ﬁxed |V|, d, d

tied

< C

bilinear

≤ C

dual

≤ C

base

, (6)

where

tied

base

bilinear

and

dual

respectively corre-

spond to the number of dedicated parameters of an output

layer with (Eq. 2) and without (Eq. 1) weight tying, using

the bilinear mapping (Eq. 3) and the dual nonlinear mapping

(Eq. 5) which are assumed to be nonzero except C

tied

Given this analysis, we identify and aim to address the

following limitations of the previously proposed output layer

parameterisations for language generation:

(a)

Shallow modeling of the label space. Output labels

are mapped into the joint space with a single (possibly

nonlinear) projection. Its power can only be increased

by increasing the dimensionality of the joint space.

(b)

Tendency to overﬁt. Increasing the dimensionality of

the joint space and thus the power of the output classi-

ﬁer can lead to undesirable effects such as overﬁtting

in certain language generation tasks, which limits its

applicability to arbitrary domains.

3. Deep Residual Output Layers

To address the aforementioned limitations we propose a

deep residual output layer architecture for neural language

generation which performs deep modeling of the structure

of the output space while it preserves acquired information

and avoids overﬁtting. Our formulation adopts the gen-

eral form and the basic principles of previous output layer

parametrizations which aim to capture the output structure

explicitly in Section 2.3, namely (i) learning rich output

structure, (ii) controlling the output layer capacity indepen-

dently of the dimensionality of the vocabulary, the encoder

and the word embedding, and, lastly, (iii) avoiding costly

label-set-size dependent parameterisations.

3.1. Overview

A general overview of the proposed architecture for neural

language generation is displayed in Fig. 1. We base our

output layer formulation starting on the general form of the

dual nonlinear mapping of Eq. 4:

p(y

t−1

) ∝ exp



out

(E)g

) + b



. (7)

The input network

(·)

takes as input a sequence of words

represented by their input word embeddings

which have

been encoded in a context representation

for the given

time step

. The output or label network

out

(·)

takes as

input the word(s) describing each possible output label and

encodes them in a label embedding

(k)

where

is the

depth of the label encoder network. Next, we deﬁne these

two proposed networks, and then we discuss how the model

is trained and how it relates to previous output layers.

out

, w

, …, w

, w

, …, w

|V|

(k)

Input text

Output text

Figure 1. General overview of the proposed architecture.

3.2. Label Encoder Network

For language generation tasks, the output labels are each a

word in the vocabulary

. We assume that these labels are

represented with their associated word embedding, which is

a row in

. In general, there may be additional information

about each label, such as dictionary entries, cross-lingual

resources, or contextual information, in which case we can

add an initial encoder for these descriptions which outputs

a label embedding matrix

∈ IR

|V|×d

. In this paper we

make the simplifying assumption that

= E

and leave the

investigation of additional label information to future work.

3.2.1. LEARNING OUTPUT STRUCTURE

To obtain a label representation which is able to encode rich

output space structure, we deﬁne the

out

(·)

function to be

a deep neural network with

layers which takes the label

embedding

as input and outputs its deep label mapping

at the last layer, g

out

(E) = E

(k)

, as follows:

(k)

= f

(k)

out

(k−1)

), (8)

where

is the depth of the network and each function

(i)

out

(·)

at the

layer is a nonlinear projection of the following

form:

(i)

out

(i−1)

) = σ(E

(i−1)

(i)

+ b

(i)

), (9)

where

σ(·)

is a nonlinear activation function such as

ReLU

Tanh

, and the matrix

(i)

∈ IR

d×d

and the bias

(i)

∈

are the linear projection of the encoded outputs at the

layer. Note that when we restrict the above label network

to have one layer depth the projection is equivalent to the

label mapping from previous work in Eq. 5.

3.2.2. PRESERVING INFORMATION

The multiple layers of projections in Eq. 8 force the relation-

ship between word embeddings

and label embeddings

(k)

to be highly nonlinear. To preserve useful informa-

tion from the original word embeddings and to facilitate

the learning of the label network we add a skip connection

directly to the input embedding. Optionally, for very deep

label networks, we also add a residual connection to previ-

ous layers as in (He et al., 2016). With these additions the

projection at the k

layer becomes:

(k)

= f

(k)

out

(k−1)

) + E

(k−1)

+ E (10)

剩余11页未读，继续阅读

永远的12

粉丝: 934
资源: 320

深度残差输出层优化神经语言生成

"深度残差学习框架：解决深层神经网络训练难题

基于Pixel Recurrent Neural Networks的图像生成网络

多层次残差网络：Residual Networks of Residual Networks

Aggregated Residual Transformations for Deep Neural Networks.pdf

Deep Residual Shrinkage Networks for Fault Diagnosis.pdf

Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition.pdf

Deep Residual Learning for Image Recognition 机翻

paper reading:Deep Residual Learning for Image Recognition

Deep Residual Learning for Weakly-Supervised Relation Extraction

最新资源