语言建模的进步与技术概览

NLP

需积分: 9 139 浏览量更新于2024-07-18 收藏 688KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"Progress in Language Modeling by Joshua T. Goodman - A technical report on advancements in natural language processing (NLP), focusing on improvements over simple language models." 在自然语言处理（NLP）领域，语言建模是核心，它涉及计算一系列单词的概率。这一技术在多个领域中具有广泛的应用，如语音识别、光学字符识别、手写识别、机器翻译和拼写纠错。早期常用的语言模型，如Katz平滑的三元组模型，虽然简单，但在实际应用中存在局限性。报告作者Joshua T. Goodman，来自微软研究院的机器学习与应用统计组，他在2008年发表的技术报告MSR-TR-2001-72中，详细介绍了在语言模型上的进步。他指出，尽管简单的三元组模型在许多任务中表现出色，但已经有许多改进方法超越了这种基础模型。首先，缓存（Caching）技术用于提高对最近或频繁出现的词序列的预测效率，通过存储之前计算的结果来减少计算负担。其次，聚类（Clustering）方法通过将词汇划分为不同的类别，降低了模型的复杂度，同时保持了一定的准确性。更高阶的n-gram模型，如四元组或五元组模型，可以捕获更复杂的上下文依赖关系，但也增加了计算和存储的需求。跳跃模型（Skipping Models）允许模型跳过某些单词，从而更灵活地处理长距离的依赖关系。最后，句子混合模型（Sentence-Mixture Models）结合了不同类型的句子结构，提高了对多样化文本的理解能力。这些模型通常会根据任务需求和可用资源进行选择和优化。报告还可能探讨了如何在实际应用中平衡模型复杂性和性能，以及如何利用这些技术来提升特定NLP任务的准确性和效率。通过对这些进步的深入理解，研究人员和工程师可以更好地设计和实施适用于各种自然语言处理场景的智能系统。因此，"Progress in Language Modeling"不仅是一篇关于NLP技术演进的重要文献，也是理解和改进语言模型的关键资源。

资源详情

资源推荐

lated Kneser-Ney smoothing on n-gram orders from 1 to 10, as well as 20, and

over o ur standard data sizes. The results are shown in Figure 2.

As can be seen, and has been previously observed (Chen and Goodman,

1999), the behavior for Katz smoothing is very diﬀerent than the behavior for

Kneser-Ney smoothing. Chen a nd Goodman determined that the main cause of

this diﬀerence was that backoﬀ smoothing techniques, such as Katz smoothing,

or even the backoﬀ version of K neser-Ney smoothing (we use only interpolated

Kneser-Ney smoothing in this work), work poorly on low counts, especially

one counts, and that as the n-gram order increases, the number of one counts

increases. In particular, Katz smoothing has its best performance around the

trigram level, and actually gets worse as this level is exceeded. Kneser-Ney

smoothing on the other hand is essentially monotonic even through 20-grams.

The plateau point for Kneser-Ney smoothing depends on the amount of

training data available. For small amounts, 100,00 0 words, the plateau point

is at the trigram level, whereas when using the full training data, 280 million

words, small improvements occ ur even into the 6- gram (.02 bits better than

5-gram) and 7-gram (.01 bits better than 6-gram.) Diﬀerences of this size are

interesting, but not of practical importance. The diﬀerence between 4-grams and

5-grams, .06 bits, is perhaps important, and so, for the rest of our experiments,

we often use models built on 5-gram da ta, which appears to give a good tradeoﬀ

between computational resources and perfo rmance.

Note that in practice, going beyond trigrams is often impractical. The trade-

oﬀ between memory and performance typically req uires heavy pruning of 4-

grams and 5-grams, re ducing the potential improvement from them. Through-

out this pa per, we ignore memory-performance tradeoﬀs, since this would overly

complicate alrea dy diﬃcult comparisons. We seek instead to build the single

best system possible, ignoring memory issues, and leaving the more practical,

more interesting, and very much more complicated issue of ﬁnding the best sys-

tem at a given memory size, for future resea rch (and a bit of past research, too

(Goodman and Gao, 2000)). Note that many of the experiments done in this

section could not be done at all without the special tool described brieﬂy at the

end of this paper , and in more detail in the appendix.

4 Skipping

As one moves to larger and larger n-grams, there is less and less chance of having

seen the exact context before; but the chance of having seen a similar context,

one with most of the words in it, increases. Skipping models (Rosenfeld, 1994;

Huang et al., 1993; Ney et al., 1994; Martin et al., 1999; Siu and Ostendorf,

2000) make use of this observation. There are also variations on this technique,

such as techniques using lattices (Saul and Pereira, 1997; Dupont and Rosenfeld,

1997), or models combining clas ses and words (Blasig, 1999).

When considering a ﬁve-g ram context, there are many subsets of the ﬁve-

gram we could consider, such as P (w

i−4

i−3

i−1

) or P (w

i−4

i−2

i−1

Perhaps we have never s een the phras e “Show John a good time” but we

have seen the phrase “Show Stan a good time.” A normal 5-gram predict-

ing P (time|show John a good) would back oﬀ to P (time|John a good) and from

there to P (time|a good), which would have a relatively low probability. On the

other hand, a s kipping model of the form P (w

i−4

i−2

i−1

) would assign

high probability to P (time|show

a good).

These skipping 5-grams are then interpolated with a normal 5-gram, forming

models such as

λP (w

i−4

i−3

i−2

i−1

) + µP (w

i−4

i−3

i−1

) + (1−λ−µ)P (w

i−4

i−2

i−1

)

where, as usual, 0 ≤ λ ≤ 1 and 0 ≤ µ ≤ 1 and 0 ≤ (1 − λ − µ) ≤ 1.

Another (and more traditiona l) use for skipping is as a sort of poor man’s

higher order n-gram. One can, for instance, create a model of the form

λP (w

i−2

i−1

) + µP (w

i−3

i−1

) + (1 − λ − µ)P (w

i−3

i−2

)

In a model of this form, no comp onent probability depends on more than two

previous words, like a trigra m, but the overall probability is 4-gram-like, since

it depends on w

i−3

, w

i−2

, and w

i−1

. We ca n extend this idea even further,

combining in all pairs of contexts in a 5-gram-like, 6-gram-like, or even 7-gram-

like way, with each component probability never depending on more than the

previous two words.

We performed two sets of experiments, one on 5-grams and one on trigrams.

Fo r the 5-gr am skipping expe riments, all contexts depended on at most the

previous four words, w

i−4

, w

i−3

, w

i−2

, w

i−1

, but used the four words in a variety

of ways. We tried six models, all of which were interpolated with a baseline 5-

gram model. For readability and conciseness, we deﬁne a new notation, letting

v = w

i−4

, w = w

i−3

, x = w

i−2

and y = w

i−1

, allowing us to avoid numerous

subscripts in what follows. The results are shown in Fig ure 3.

The ﬁrst model interpolated dependencies on vw

y and v xy. This simple

model does not work well on the smallest tr aining data sizes, but is compe titive

for larger ones. Next, we tried a simple variation on this model, which also inter-

polated in vwx

. Making that simple addition leads to a good- sized improvement

at all levels, roughly .02 to .04 bits over the simpler skipping model. O ur next

variation was a nalogous, but adding back in the dependencies on the missing

words. In particular, we interpolated together xvwy, wvxy, and yvwx; that

is, all models depended on the same variables, but with the interpolation order

modiﬁed. For instance, by xvwy, we refer to a model of the form P (z|v wxy) in-

terpolated with P (z|vw

y) interpolated with P (z|w y) interpolated with P (z|y)

interpola ted with P (z|y) interpolated with P (z). All of these experiments were

done with Interpolated Kneser-Ney smoothing, so all but the ﬁrst probability

uses the modiﬁed backoﬀ distribution. This model is just like the previous one,

but for each component sta rts the interpolation with the full 5-gram. We had

hoped that in the case where the full 5-gram had occurred in the training data,

this would make the skipping model more accurate, but it did not help at all.

In fact, it hurt a tiny bit, 0.005 bits at the 10,000,000 word training level. This turned

We also wanted to try mo re radical approaches. For instance, we tried inter-

polating together vwyx with vxyw and wxyv (along with the baseline vwxy).

This model puts each of the four preceding words in the last (most important)

position for one component. This model does not work as well as the previous

two, leading us to conclude that the y word is by far the most impo rtant. We

also tried a model with vwyx, vywx, yvwx, which puts the y word in each

possible position in the backoﬀ model. This was overall the worst model, recon-

ﬁrming the intuition that the y word is critical. However, as we saw by a dding

vwx

to vw y and v xy, having a component with the x position ﬁnal is also

impo rtant. This will also be the cas e for trigrams.

Finally, we wanted to get a sort of upper bound on how well 5-gram models

could work. For this, we interpolated together vwyx, vxyw, wxyv, vywx, yvwx,

xvwy and wvxy. This model was chosen as one that would include as many

pairs and triples of combinations of words as possible. The result is a marginal

gain – less than 0 .01 bits – over the best previous model.

We do not ﬁnd these results particularly encouraging. In particular, when

compared to the sentence mixture results tha t will be presented later, there

seems to be less potential to be gained from skipping models . Also, while

sentence mixture models appear to lead to larger gains the more data that

is used, skipping models appear to get their maximal gain around 10,000,0 00

words. Presumably, at the largest data siz e s, the 5-gram model is becoming

well trained, and there are fewer instances where a sk ipping model is useful but

the 5-gram is not.

We also examined trig ram-like models . These results are shown in Figure

4. The baseline for comparison was a trigram model. For comparison, we also

show the relative improvement of a 5-gram model over the trigram, and the

relative improvement of the skipping 5-gram with vw

y, v xy and vwx . For the

trigram skipping models, each component never depended on more than two of

the previous words. We tried 5 experiments of this form. First, based on the

intuition that pairs using the 1-back word (y) are most useful, we interpolated

xy, wy, vy, uy and ty models. This did not work particularly well, except

at the largest sizes. Presumably at those size s, a few appropriate instances

of the 1-back word had always been seen. Next, we tried using all pairs of

words through the 4-gram level: xy, wy and wx. Considering its simplicity, this

worked very well. We tried similar models using all 5-gram pairs, all 6-gram

pairs and all 7-gram pairs; this last model contained 15 diﬀerent pairs. However,

the improvement over 4-gra m pairs was still marginal, especially co nsidering the

large number of increased parameters.

The trigram skipping results are, relative to their baseline, much better

than the 5-gram skipping results. They do not appear to have plateaued when

more data is used and they are much more comparable to sentence mixture

models in terms of the improvement they get. Furthermore, they lead to more

out to be due to technical smoothing is sues. In particular, after some experimentation, this

turned out to be due to our use of Interpolated Kneser-Ney smoothing with a single discount,

even though we know that using multiple discounts is better. When using multiple dis counts,

the problem goes away.

剩余72页未读，继续阅读

robinguorui

粉丝: 1
资源: 3

语言建模的进步与技术概览

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling (1312.3005v3)-计算机科学

Udemy - Deep Learning Recurrent Neural Networks in Python

Masked Language Modeling 近红外光谱

Masked Language Modeling用于光谱分类模型

bi-vldoc: bidirectional vision-language modeling for visually-rich document

gpt shifted right

帮我写一段关于UML（Unified Modeling Language）的介绍

dense attention

window attention blocks

topic modeling matlab

Generative Pre-trained Transformer

A language choice must be made using the %language directive prior to using GENERATE or GENERATE_TYPE Main program

Visual Studio 2015没有Modeling Projects

上述回答中 modeling.updateProperties(shape, { name: 'New Name' }); modeling怎么获取

What is muti-head attention?

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

regression modeling strategies pdf 微盘

Visual Studio Modeling Tools在哪

self attention layer

最新资源