自然语言处理的进展与挑战

170 浏览量更新于2024-08-03 收藏 763KB PDF 举报

“Advances in Natural Language Processing探讨了自然语言处理领域的最新进展，涵盖了从基础技术到现实应用的各种话题。” 自然语言处理（Natural Language Processing，NLP）是计算机科学的一个分支，其目标是使计算机能够理解、解释和生成人类语言。随着技术的发展，NLP已经从最初的专注于语言结构分析和基础技术开发（如机器翻译、语音识别和语音合成）转变为如今在各种实际应用中的广泛使用。过去二十年间，NLP取得了显著的进步，并成为科学研究和实用技术的热门领域。这一领域的发展受到四个关键因素的推动： 1. 数据爆炸：互联网和其他数字平台产生了前所未有的大量文本数据，为训练更复杂的模型提供了丰富的语料库。 2. 计算能力提升：摩尔定律驱动的计算性能增长使得处理更大规模的模型和算法成为可能，从而提高了NLP的准确性和效率。 3. 深度学习的兴起：深度神经网络，尤其是递归神经网络（RNN）、长短时记忆网络（LSTM）、Transformer和BERT等预训练模型，极大地改善了NLP任务的性能，尤其是在理解和生成自然语言方面。 4. 开源工具和框架：如TensorFlow、PyTorch和NLTK等开源工具和框架的出现，降低了NLP研究和应用的门槛，促进了该领域的快速发展。 NLP的应用已经渗透到日常生活中，例如苹果的Siri和微软的Skype Translator，它们实现了智能语音对话系统和实时语音翻译。此外，NLP还被用于社交媒体的情感分析，以获取关于健康或金融的信息，以及对产品和服务的公众情绪和情感。尽管取得了这些成就，NLP仍面临诸多挑战。首先，语言的复杂性和多样性意味着模型需要处理多语言、方言和口音等问题。其次，理解语境和隐喻是当前模型的一大难题。再者，隐私和伦理问题也日益突出，因为处理个人数据时必须确保安全和合规。最后，模型的可解释性和透明度也需要改进，以便用户和开发者能更好地理解其工作原理。自然语言处理领域的进步正在不断推动人工智能的边界，但同时也需要面对不断涌现的新挑战，这需要研究人员持续创新和探索，以实现更智能、更人性化的交互体验。

REVIEW

Advances in natural

language processing

Julia Hirschberg

* and Christopher D. Manning

2,3

Natural language processing emplo y s computational techniques for the purpose of learning,

understanding, and producing human language content. Early computational approaches to

language resear ch focused on automating the analysis of the linguistic structur e of language

and developing basic technologies such as machine translation, speech recognition, and speech

synthesis. Today’s researcher s refine and make use of such tools in real-world applications,

creating spok en dialogue systems and speech-to-speech translation engines, mining social

media for information about health or finance, and identifying sentiment and emotion towar d

products and services. We describe successes and challenges in this rapidly advancing ar ea.

ver the past 20 years, computational lin-

guistics has grown into both an exciting

area of scientific research and a practical

technology that is increasingly being in-

corporated into consumer products (for

example, in applications such as Apple’sSiriand

Skype Translator). Four key factors enabled these

developments: (i) a vast increase in computing

power , (ii) the availability of very large amounts

of linguistic data, (iii) the development of highly

successful machine learning (ML) methods, and

(iv) a much richer understanding of the structure

of human language and its deployment in social

contexts. In this Review, we describe some cur-

rent application areas of interest in language

research. These efforts illustrate computational

approaches to big data, based on current cut ting-

edge methodologies that combine statistical anal-

ysis and ML with knowledge of language.

Computational linguistics, also known as nat-

ural language processing (NLP), is the subfield

of computer science concerned with using com-

putational techniques to learn, understand, and

produce human language content. C omputation-

al linguistic systems can have multiple purposes:

The goal can be aiding human-human commu-

nication, such as in machine translation (MT);

aiding human-machine communication, such as

with conversational agents; or benefiting both

humans and machines by analyzing and learn-

ing from the enormous quantity of human lan-

guage content that is now available online.

During the first several decades of work in

computational linguistics, scientists attempted

to write down for computers the vocabularies

and rules of human languages. This proved a

difficult task, owing to the variability, ambiguity,

and context-dependent interpretation of human

languages. For instance, a star can be either an

astronomical object or a person, and “star” can

be a noun or a verb. In another example, two in-

terpretations are possible for the headline “Teacher

strikes idle kids,” depending on the noun, verb, and

adjective assignments of the words in the sentence,

as well as grammatical structure. Beginning in the

1980s, but more widely in the 1990s, NLP was

transformed by researchers starting to build mod-

els over large quantities of empirical language

data. Statistical or corpus (“body of words”)–based

NLP was one of the first notable successes of

the use of big data, long before the power of

ML was more generally recognized or the term

“big data” even introduced.

A central finding of this statistical approach to

NLP has been that simple methods using words,

part-of-speech (POS) sequences (such as whether

a word is a noun, verb, or preposition), or simple

templates can often achieve notable results when

trained on large quantities of data. Many text

and sentiment classifiers are still based solely on

the different sets of words (“bag of words”)that

documents contain, without regard to sentence

and discourse structure or meaning. Achieving

improvements over these simple baseli ne s can be

quite difficult. Nevertheless, the best-performing

systems now use sophisticated ML approaches

and a rich understanding of linguistic structure.

High-performance tools that identify syntactic

and semantic information as well as information

about discourse context are now available. One

example is Stanford CoreNLP (1), which provides

a standard NLP preprocessing pipeline that in-

cludes POS tagging (with tags such as noun, verb,

and preposition); identification of named entities,

such as people , places, and organizations; parsing

of sentences into their grammatical structures ;

and identifying co-references between noun

phrase mentions (Fig. 1).

Historically, two developments enabled the

initial transformation of NLP into a big data field.

The first was the early availability to researchers

of linguistic data in digital form, particularly

through the Linguistic Data Consortium (LDC)

(2), established in 1992. Today, large amounts

of digital text can easily be downloaded from

the Web. Available as linguistically annotated

data are large speech and text corpora anno-

tated with POS tags, syntactic parses, semantic

labels, annotations of named entities (persons,

places, organizations), dialogue acts (statement,

question, request), emotions and positive or neg-

ative sentiment, and discourse structure (topic

or rhetorical structure). Second, performance im-

provements in NLP were spurred on by shared

task competitions. Originally, these competitions

were largely funded and organized by the U.S.

Department of Defense, but they were later or-

ganized by the research community itself, such

as the CoNLL Shared Tasks (3). These tasks were

a precursor of modern ML predictive modeling

and analytics competitions, such as on Kaggle (4),

in which companies and researchers post their

data and statisticians and data miners from all over

theworldcompetetoproducethebestmodels.

A major limitation of NLP today is the fact that

most NLP resources and systems are available

only for high-resource languages (HRLs), such as

English, French, Spanish, German, and Chinese.

In contrast, many low-resource languages (LRLs)—

such as Bengali, Indonesian, Punjabi, Cebuano,

and Swahili—s pok en and wri t ten by millions of

people have no such resources or systems avail-

able.Afuturechallengeforthelanguagecommu-

nity is how to develop resources and tools for

hundreds or thousands of languages, not just a few.

Machine translation

Proficiency in languages was traditionally a hall-

mark of a learned person. Although the social

standing of this human skill has declined in the

modern age of science and machines, translation

between human languages remains crucially im-

portant, and MT is perhaps the most substantial

way in which computers could aid human-human

communication. Moreover, the ability of com-

puters to translate between human languages

remains a consummate test of machine intel-

ligence: Correct translation requires not only

the ability to analyze and generate sentences in

human languages but also a humanlike under-

standing of world knowledge and context, de-

spite the ambiguities of languages. For example,

the French word “bordel” st raightforwardly means

“brothel”; but if someone says “My room is un

bordel,” then a translating machine has to know

enough to suspect that this person is probably not

running a brothel in his or her room but rather is

saying “My room is a complete mess.”

Machine translation was one of the first non-

numeric applications of computers and was studied

intensively starting in the late 1950s. However , the

hand-built grammar-basedsystemsofearlydec-

ades achieved very limited success. The field was

transformed in the early 1990s when researchers

at IBM acquired a large quantity of English and

French sentences that weretranslationsofeach

other (known as parallel text), produced as the

proceedings of the bilingual Canadian Parliament.

These data allowed them to collect statistics of

word translations and word sequences and to

build a probabilistic model of MT (5).

Following a quiet period in the late 1990s,

the new millennium brought the potent combina-

tion of ample online text, including considerable

quantities of parallel text, much more abundant

and inexpensive computing, and a new idea

for building statistical phrase-based MT systems

SCIENCE sciencemag.org 17 J ULY 2015 • VOL 349 ISSUE 6245 261

Department of Computer Science, Columbia University, New York,

NY 10027, USA.

Department of Linguistics, Stanford University,

Stanford, CA 94305-2150, USA.

Department of Computer

Science, Stanford University, Stanford, CA 94305-9020, USA.

*Corresponding author. E-mail: julia@cs.columbia.edu

on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from

下载后可阅读完整内容，剩余6页未读，立即下载