高效紧凑的N-gram语言模型提升机器翻译性能

下载需积分: 1 | PDF格式 | 253KB | 更新于2024-09-11 | 30 浏览量 | 举报

1 收藏

"本文档'Faster and Smaller N-Gram Language Models'由Adam Pauls和Dan Klein两位作者撰写，来自加利福尼亚大学伯克利分校计算机科学系。研究焦点在于改进N-gram语言模型在机器翻译等领域的效率与存储需求。N-gram语言模型是当前机器翻译系统中的关键瓶颈，因为它们需要在速度和紧凑性上达到平衡。首先，作者提出了高效且快速查询的语言模型实现方法。他们的新设计能够达到SRILM（Statistical Relational Inference and Learning for Models）这样的广泛使用的工具的速度，同时仅需存储空间的25%，显著降低了存储需求。这表明了在保持性能的同时，对资源管理的高度优化。文章的核心部分着重于构建更紧凑的存储方案。他们展示了如何将谷歌n-gram语料库中所有的40亿n-gram及其关联计数压缩到每个n-gram只需23位，这是迄今为止最紧凑的无损压缩技术，甚至超过了近期的有损压缩技术。这种压缩技术对于存储资源有限的环境具有重要意义。此外，文中还探讨了提升解码阶段查询速度的技术。特别提出了一种创新的语言模型缓存技术，该技术能显著提高包括他们自己的模型和SRILM在内的语言模型的查询速度，最高提升幅度达到了300%。这一优化对于实时性和响应时间敏感的应用至关重要。总结来说，这篇论文不仅提供了高效的N-gram语言模型实现，还探索了在保证准确度的前提下，如何通过优化存储和查询策略来改善机器翻译系统的整体性能。这对于语言处理软件，特别是输入法和机器翻译工具的发展具有重要价值。"

Faster and Smaller N -Gram Language Models

Adam Pauls Dan Klein

Computer Science Division

University of California, Berkeley

{adpauls,klein}@cs.berkeley.edu

Abstract

N-gram language models are a major resource

bottleneck in machine translation. In this pa-

per, we present several language model imple-

mentations that are both highly compact and

fast to query. Our fastest implementation is

as fast as the widely used SRILM while re-

quiring only 25% of the storage. Our most

compact representation can store all 4 billion

n-grams and associated counts for the Google

n-gram corpus in 23 bits per n-gram, the most

compact lossless representation to date, and

even more compact than recent lossy compres-

sion techniques. We also discuss techniques

for improving query speed during decoding,

including a simple but novel language model

caching technique that improves the query

speed of our language models (and SRILM)

by up to 300%.

1 Introduction

For modern statistical machine translation systems,

language models must be both fast and compact.

The largest language models (LMs) can contain as

many as several hundred billion n-grams (Brants

et al., 2007), so storage is a challenge. At the

same time, decoding a single sentence can trig-

ger hundreds of thousands of queries to the lan-

guage model, so speed is also critical. As al-

ways, trade-offs exist between time, space, and ac-

curacy, with many recent papers considering small-

but-approximate noisy LMs (Chazelle et al., 2004;

Guthrie and Hepple, 2010) or small-but-slow com-

pressed LMs (Germann et al., 2009).

In this paper, we present several lossless meth-

ods for compactly but efﬁciently storing large LMs

in memory. As in much previous work (Whittaker

and Raj, 2001; Hsu and Glass, 2008), our meth-

ods are conceptually based on tabular trie encodings

wherein each n-gram key is stored as the concatena-

tion of one word (here, the last) and an offset encod-

ing the remaining words (here, the context). After

presenting a bit-conscious basic system that typiﬁes

such approaches, we improve on it in several ways.

First, we show how the last word of each entry can

be implicitly encoded, almost entirely eliminating

its storage requirements. Second, we show that the

deltas between adjacent entries can be efﬁciently en-

coded with simple variable-length encodings. Third,

we investigate block-based schemes that minimize

the amount of compressed-stream scanning during

lookup.

To speed up our language models, we present two

approaches. The ﬁrst is a front-end cache. Caching

itself is certainly not new to language modeling, but

because well-tuned LMs are essentially lookup ta-

bles to begin with, naive cache designs only speed

up slower systems. We present a direct-addressing

cache with a fast key identity check that speeds up

our systems (or existing fast systems like the widely-

used, speed-focused SRILM) by up to 300%.

Our second speed-up comes from a more funda-

mental change to the language modeling interface.

Where classic LMs take word tuples and produce

counts or probabilities, we propose an LM that takes

a word-and-context encoding (so the context need

not be re-looked up) and returns both the probabil-

ity and also the context encoding for the sufﬁx of the

original query. This setup substantially accelerates

the scrolling queries issued by decoders, and also

exploits language model state equivalence (Li and

Khudanpur, 2008).

Overall, we are able to store the 4 billion n-grams

of the Google Web1T (Brants and Franz, 2006) cor-

下载后可阅读完整内容，剩余9页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

yaorugang

粉丝: 1

高效紧凑的N-gram语言模型提升机器翻译性能

Mathematical Models for Remote Sensing Image Processing

CR-GAN 预训练模型netG_vzx.pth

Speednaddr

faster_r-cnn_models下载链接.txt

vue3-smaller-faster-stronger

tf-Faster-RCNN-master.rar_citizenghr_faster_faster RCNN_faster-r

faster cnn.rar_faster_faster RCNN_faster-cnn_faster-rcnn_tensorf

Faster-RCNN_TF-master.zip_CNN_faster_faster R-CNN_faster rcnn tf

Faster-than-Faster-Better-than-Better-Machine-Learning-Electromechanical-Switching-in-Ferroelec:快速更好的机器学习方法

目标检测模型-Faster-RCNN模型-Pytorch版本

最新资源

CR-GAN　预训练模型netG_vzx.pth