突破Softmax瓶颈：高阶RNN语言模型

下载需积分: 10 | PDF格式 | 419KB | 更新于2024-07-17 | 114 浏览量 | 举报

"Breaking the Softmax Bottleneck: A High-Rank RNN Language Model" 这篇论文是ICLR 2018会议上发表的，作者是Zhilin Yang、Zihang Dai、Ruslan Salakhutdinov和William W. Cohen，他们来自卡内基梅隆大学的计算机科学学院。论文主要探讨了语言建模中的一个关键问题，即“softmax瓶颈”，并提出了一种基于高秩循环神经网络（RNN）的语言模型来解决这一问题。在传统的神经网络语言模型中，softmax层通常用于计算每个词汇项在给定上下文下的概率分布。然而，softmax瓶颈指的是这种模型的表达能力受到限制，因为它们无法充分捕捉自然语言的复杂性和上下文依赖性。尽管分布式词嵌入能提供一定的表示能力，但当处理高度依赖上下文的自然语言时，softmax函数与这些嵌入的组合仍显得力不从心。为了解决这个问题，作者提出了一个简单而有效的策略。这个策略旨在增强模型的能力，以更好地模拟自然语言的丰富结构。通过引入高秩矩阵分解技术，他们扩展了RNN语言模型的表示空间，从而提升了模型对上下文关系的建模能力。实验结果显示，采用这种新方法的模型在Penn Treebank和WikiText-2数据集上取得了显著的性能提升，分别将困惑度降低到47.69和40.68，相比之前的方法有显著改进。此外，该模型在大规模的1B Word数据集上也表现出色，相比于基线模型，困惑度降低了超过5.6个点，这表明其在处理大量文本数据时依然保持高效。此研究不仅提高了语言建模的准确性，还为神经网络语言模型的设计提供了新的思路，有助于未来模型更好地理解和生成自然语言。它强调了克服softmax瓶颈对于提高语言模型性能的重要性，并且证明了高秩矩阵分解在提升模型复杂性和表达能力方面的潜力。

展开

Published as a conference paper at ICLR 2018

Given the hypothesis that natural language is high-rank, it is clear that the Softmax bottleneck limits

the expressiveness of the models. In practice, the embedding dimension d is usually set at the scale

of 10

, while the rank of A can possibly be as high as M (at the scale of 10

), which is orders of

magnitude larger than d. Softmax is effectively learning a low-rank approximation to A, and our

experiments suggest that such approximation loses the ability to model context dependency, both

qualitatively and quantitatively (Cf. Section 3).

2.3 EASY FIXES?

Identifying the Softmax bottleneck immediately suggests some possible “easy ﬁxes”. First, as con-

sidered by a lot of prior work, one can employ a non-parametric model, namely an Ngram model

(Kneser & Ney, 1995). Ngram models are not constrained by any parametric forms so it can univer-

sally approximate any natural language, given enough parameters. Second, it is possible to increase

the dimension d (e.g., to match M) so that the model can express a high-rank matrix A.

However, these two methods increase the number of parameters dramatically, compared to using

a low-dimensional Softmax. More speciﬁcally, an Ngram needs (N × M) parameters in order to

express A, where N is potentially unbounded. Similarly, a high-dimensional Softmax requires (M ×

M) parameters for the word embeddings. Increasing the number of model parameters easily leads

to overﬁtting. In past work, Kneser & Ney (1995) used back-off to alleviate overﬁtting. Moreover,

as deep learning models were tuned by extensive hyper-parameter search, increasing the dimension

d beyond several hundred is not helpful

(Merity et al., 2017; Melis et al., 2017; Krause et al., 2017).

Clearly there is a tradeoff between expressiveness and generalization on language modeling. Naively

increasing the expressiveness hurts generalization. Below, we introduce an alternative approach that

increases the expressiveness without exploding the parametric space.

2.4 MIXTURE OF SOFTMAXES: A HIGH-RANK LANGUAGE MODEL

We propose a high-rank language model called Mixture of Softmaxes (MoS) to alleviate the Softmax

bottleneck issue. MoS formulates the conditional distribution as

(x|c) =

k=1

c,k

exp h

c,k

exp h

c,k

; s.t.

k=1

c,k

= 1

where π

c,k

is the prior or mixture weight of the k-th component, and h

c,k

is the k-th context vec-

tor associated with context c. In other words, MoS computes K Softmax distributions and uses a

weighted average of them as the next-token probability distribution. Similar to prior work on re-

current language modeling (Merity et al., 2017; Melis et al., 2017; Krause et al., 2017), we ﬁrst

apply a stack of recurrent layers on top of X to obtain a sequence of hidden states (g

, · · · , g

The prior and the context vector for context c

are parameterized as π

exp w

π,k

exp w

π,k

and

= tanh(W

h,k

) where w

π,k

and W

h,k

are model parameters.

Our method is simple and easy to implement, and has the following advantages:

• Improved expressiveness (compared to Softmax). MoS is theoretically more (or at least equally)

expressive compared to Softmax given the same dimension d. This can be seen by the fact that

MoS with K = 1 is reduced to Softmax. More importantly, MoS effectively approximates A by

MoS

= log

k=1

exp(H

θ,k

)

where Π

is an (N × N) diagonal matrix with elements being the prior π

c,k

. Because

MoS

a nonlinear function (log_sum_exp) of the context vectors and the word embeddings,

MoS

can

be arbitrarily high-rank. As a result, MoS does not suffer from the rank limitation, compared to

Softmax.

This is also conﬁrmed by our preliminary experiments.

下载后可阅读完整内容，剩余17页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

superyang198608

粉丝: 13

突破Softmax瓶颈：高阶RNN语言模型

Optimizing-Softmax-Regression-with-MCMC

margin-softmax:keras的margin-softmax稀疏实现

ECG-classification-based-on-RNN

CATMAP-RNN:CATMAP RNN实施

Neural_Speed_Reading_via_Skim-RNN_PyTorch:PyTorch实施“通过Skim-RNN进行神经速度读取”

Replicated-Softmax-Model:在论文“ Replicated softmax”中重复实验

探索深度：自然语言处理序列模型（04课）- Vanilla RNN与LSTM详解

softmax和L-softmax的区别 以及L-softmax的优势

设计softmax回归模型在Fashion-MNIST数据集上实现多类别分类任务的实验目的是什么？

将DA-RNN中RNN换为GRU网络

最新资源

softmax和L-softmax的区别以及L-softmax的优势