一亿字语料库：衡量统计语言模型进展的新基准

18 浏览量更新于2024-08-25 收藏 54KB PDF 举报

"《一亿单词基准：统计语言模型进步度量》(arXiv:1312.3005v3, cs.CL, 2014年3月)是Ciprian Chelba、Tomas Mikolov、Mike Schuster、Qi Ge和Thorsten Brants等人合作提出的一项重要研究。他们针对自然语言处理领域，设计了一种全新的大规模语料库——One Billion Word Benchmark，旨在衡量和比较统计语言模型（如n-gram模型、基于神经网络的语言模型等）的进步。该研究在Google、爱丁堡大学和Cantab Research Ltd等机构的支持下进行，使用了接近10亿单词的训练数据，为评估和比较新颖语言建模技术提供了宝贵的资源。基准测试的核心目标是快速测量模型的性能，并理解它们在与其他先进技术结合时的实际贡献。报告指出，基线模型采用的是未经剪枝的Kneser-Ney 5-gram模型，其困惑度（Perplexity）达到了67.6。通过整合不同的技术，研究者实现了约35%的困惑度降低，即降低了大约10%的交叉熵（bits），这标志着显著的性能提升。这项工作对于理解和推动统计语言模型的发展具有重要意义，因为它提供了一个统一的框架来比较模型的优劣，不仅对学术界的研究有深远影响，也为实际应用中的文本生成、机器翻译、语音识别等任务带来了新的挑战和机遇。通过这个庞大的语料库，研究人员能够更准确地衡量模型在复杂语言环境下的预测能力，促进了语言模型技术的不断创新和发展。"

arXiv:1312.3005v3 [cs.CL] 4 Mar 2014

One Billion Word Benchmark for Measuring Progress in

Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants

Google

1600 Amphitheatre Parkway

Mountain View, CA 94043, USA

Phillipp Koehn

University of Edinburgh

10 Crichton Street, Room 4.19

Edinburgh, EH8 9AB, UK

Tony Robinson

Cantab Research Ltd

St Johns Innovation Centre

Cowley Road, Cambridge, CB4 0WS, UK

Abstract

We propose a new benchmark corpus to be

used for measuring progress in statistical lan-

guage modeling. With almost one billion

words of training data, we hope this bench-

mark will be useful to quickly evaluate novel

language modeling techniques, and to com-

pare their contribution when combined with

other advanced techniques. We show perfor-

mance of several well-known types of lan-

guage models, with the best results achieved

with a recurrent neural network based lan-

guage model. The baseline unpruned Kneser-

Ney 5-gram model achieves perplexity 67.6.

A combination of techniques leads to 35%

reduction in perplexity, or 10% reduction in

cross-entropy (bits), over that baseline.

The benchmark is available as a

code.google.com project; besides the

scripts needed to rebuild the training/held-out

data, it also makes available log-probability

values for each word in each of ten held-out

data sets, for each of the baseline n-gram

models.

1 Introduction

Statistical language modeling has been applied to a

wide range of applications and domains with great

success. To name a few, automatic speech recogni-

tion, machine translation, spelling correction, touch-

screen “soft” keyboards and many natural language

processing applications depend on the quality of lan-

guage models (LMs).

The performance of LMs is determined mostly by

several factors: the amount of training data, quality

and match of the training data to the test data, and

choice of modeling technique for estimation from

the data. It is widely accepted that the amount of

data, and the ability of a given estimation algorithm

to accomodate large amounts of training are very im-

portant in providing a solution that competes suc-

cessfully with the entrenched n-gram LMs. At the

same time, scaling up a novel algorithm to a large

amount of data involves a large amount of work, and

provides a signiﬁcant barrier to entry for new mod-

eling techniques. By choosing one billion words as

the amount of training data we hope to strike a bal-

ance between the relevance of the benchmark in the

world of abundant data, and the ease with which any

researcher can evaluate a given modeling approach.

This follows the work of Goodman (2001a), who

explored performance of various language modeling

techniques when applied to large data sets. One of

the key contributions of our work is that the experi-

ments presented in this paper can be reproduced by

virtually anybody with an interest in LM, as we use

a data set that is freely available on the web.

Another contribution is that we provide strong

baseline results with the currently very popular neu-

ral network LM (Bengio et al., 2003). This should

allow researchers who work on competitive tech-

niques to quickly compare their results to the current

state of the art.

The paper is organized as follows: Section 2 de-

scribes how the training data was obtained; Section

3 provides a short overview of the language model-

ing techniques evaluated; ﬁnally, Section 4 presents

results obtained and Section 5 concludes the paper.

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38592332

粉丝: 7
资源: 888

一亿字语料库：衡量统计语言模型进展的新基准

1-billion-word-language-modeling-benchmark-r13output-part2

1-billion-word-language-modeling-benchmark-r13output

47Billion.github.io-src

iSAX 2.0 - Indexing and Mining One Billion Time Series-计算机科学

1 Billion Word Language Model Benchmark R13 Output 基准语料库.7z

A Few Billion Lines of Code Later - Using Static Analysis to Find Bugs in the Real World - ACM - 2010 (BLOC-coverity)-计算机科学

How to Partition a Billion-Node Graph - Microsoft Research (2016)-计算机科学

SEDA - An Architecture for Well-Conditioned, Scalable Internet Services - Deck (seda-sosp01-talk)-计算机科学

阿里巴巴-2019财年第四财季及全年阿里巴巴财报（英文）-2019.5.15-43页.pdf

How Does a GPU Shader Work (2018)-计算机科学

最新资源