We also wanted to try mo re radical approaches. For instance, we tried inter-
polating together vwyx with vxyw and wxyv (along with the baseline vwxy).
This model puts each of the four preceding words in the last (most important)
position for one component. This model does not work as well as the previous
two, leading us to conclude that the y word is by far the most impo rtant. We
also tried a model with vwyx, vywx, yvwx, which puts the y word in each
possible position in the backoff model. This was overall the worst model, recon-
firming the intuition that the y word is critical. However, as we saw by a dding
vwx
to vw y and v xy, having a component with the x position final is also
impo rtant. This will also be the cas e for trigrams.
Finally, we wanted to get a sort of upper bound on how well 5-gram models
could work. For this, we interpolated together vwyx, vxyw, wxyv, vywx, yvwx,
xvwy and wvxy. This model was chosen as one that would include as many
pairs and triples of combinations of words as possible. The result is a marginal
gain – less than 0 .01 bits – over the best previous model.
We do not find these results particularly encouraging. In particular, when
compared to the sentence mixture results tha t will be presented later, there
seems to be less potential to be gained from skipping models . Also, while
sentence mixture models appear to lead to larger gains the more data that
is used, skipping models appear to get their maximal gain around 10,000,0 00
words. Presumably, at the largest data siz e s, the 5-gram model is becoming
well trained, and there are fewer instances where a sk ipping model is useful but
the 5-gram is not.
We also examined trig ram-like models . These results are shown in Figure
4. The baseline for comparison was a trigram model. For comparison, we also
show the relative improvement of a 5-gram model over the trigram, and the
relative improvement of the skipping 5-gram with vw
y, v xy and vwx . For the
trigram skipping models, each component never depended on more than two of
the previous words. We tried 5 experiments of this form. First, based on the
intuition that pairs using the 1-back word (y) are most useful, we interpolated
xy, wy, vy, uy and ty models. This did not work particularly well, except
at the largest sizes. Presumably at those size s, a few appropriate instances
of the 1-back word had always been seen. Next, we tried using all pairs of
words through the 4-gram level: xy, wy and wx. Considering its simplicity, this
worked very well. We tried similar models using all 5-gram pairs, all 6-gram
pairs and all 7-gram pairs; this last model contained 15 different pairs. However,
the improvement over 4-gra m pairs was still marginal, especially co nsidering the
large number of increased parameters.
The trigram skipping results are, relative to their baseline, much better
than the 5-gram skipping results. They do not appear to have plateaued when
more data is used and they are much more comparable to sentence mixture
models in terms of the improvement they get. Furthermore, they lead to more
out to be due to technical smoothing is sues. In particular, after some experimentation, this
turned out to be due to our use of Interpolated Kneser-Ney smoothing with a single discount,
even though we know that using multiple discounts is better. When using multiple dis counts,
the problem goes away.
14