Published as a conference paper at ICLR 2018
Given the hypothesis that natural language is high-rank, it is clear that the Softmax bottleneck limits
the expressiveness of the models. In practice, the embedding dimension d is usually set at the scale
of 10
2
, while the rank of A can possibly be as high as M (at the scale of 10
5
), which is orders of
magnitude larger than d. Softmax is effectively learning a low-rank approximation to A, and our
experiments suggest that such approximation loses the ability to model context dependency, both
qualitatively and quantitatively (Cf. Section 3).
2.3 EASY FIXES?
Identifying the Softmax bottleneck immediately suggests some possible “easy fixes”. First, as con-
sidered by a lot of prior work, one can employ a non-parametric model, namely an Ngram model
(Kneser & Ney, 1995). Ngram models are not constrained by any parametric forms so it can univer-
sally approximate any natural language, given enough parameters. Second, it is possible to increase
the dimension d (e.g., to match M) so that the model can express a high-rank matrix A.
However, these two methods increase the number of parameters dramatically, compared to using
a low-dimensional Softmax. More specifically, an Ngram needs (N × M) parameters in order to
express A, where N is potentially unbounded. Similarly, a high-dimensional Softmax requires (M ×
M) parameters for the word embeddings. Increasing the number of model parameters easily leads
to overfitting. In past work, Kneser & Ney (1995) used back-off to alleviate overfitting. Moreover,
as deep learning models were tuned by extensive hyper-parameter search, increasing the dimension
d beyond several hundred is not helpful
3
(Merity et al., 2017; Melis et al., 2017; Krause et al., 2017).
Clearly there is a tradeoff between expressiveness and generalization on language modeling. Naively
increasing the expressiveness hurts generalization. Below, we introduce an alternative approach that
increases the expressiveness without exploding the parametric space.
2.4 MIXTURE OF SOFTMAXES: A HIGH-RANK LANGUAGE MODEL
We propose a high-rank language model called Mixture of Softmaxes (MoS) to alleviate the Softmax
bottleneck issue. MoS formulates the conditional distribution as
P
θ
(x|c) =
K
X
k=1
π
c,k
exp h
>
c,k
w
x
P
x
0
exp h
>
c,k
w
x
0
; s.t.
K
X
k=1
π
c,k
= 1
where π
c,k
is the prior or mixture weight of the k-th component, and h
c,k
is the k-th context vec-
tor associated with context c. In other words, MoS computes K Softmax distributions and uses a
weighted average of them as the next-token probability distribution. Similar to prior work on re-
current language modeling (Merity et al., 2017; Melis et al., 2017; Krause et al., 2017), we first
apply a stack of recurrent layers on top of X to obtain a sequence of hidden states (g
1
, · · · , g
T
).
The prior and the context vector for context c
t
are parameterized as π
c
t
,k
=
exp w
>
π,k
g
t
P
K
k
0
=1
exp w
>
π,k
0
g
t
and
h
c
t
,k
= tanh(W
h,k
g
t
) where w
π,k
and W
h,k
are model parameters.
Our method is simple and easy to implement, and has the following advantages:
• Improved expressiveness (compared to Softmax). MoS is theoretically more (or at least equally)
expressive compared to Softmax given the same dimension d. This can be seen by the fact that
MoS with K = 1 is reduced to Softmax. More importantly, MoS effectively approximates A by
ˆ
A
MoS
= log
K
X
k=1
Π
k
exp(H
θ,k
W
>
θ
)
where Π
k
is an (N × N) diagonal matrix with elements being the prior π
c,k
. Because
ˆ
A
MoS
is
a nonlinear function (log_sum_exp) of the context vectors and the word embeddings,
ˆ
A
MoS
can
be arbitrarily high-rank. As a result, MoS does not suffer from the rank limitation, compared to
Softmax.
3
This is also confirmed by our preliminary experiments.
4