自然语言处理中的数据增强技术：复述、噪声与抽样

需积分: 5 38 浏览量更新于2024-07-09 收藏 2.24MB PDF 举报

"哈工大发表的《自然语言处理数据增强方法》综述论文，155页，详细探讨了复述、噪声和抽样三大数据增强策略在NLP领域的应用和挑战。" 在自然语言处理（NLP）领域，数据增强已经成为提升模型性能和泛化能力的关键技术。这篇由Bohan Li, Yutai Hou和Wanxiang Che共同撰写的综述论文，全面阐述了如何利用数据增强来应对数据稀缺问题。论文首先定义了数据增强的基本概念，即通过现有数据合成新数据，以提升模型的鲁棒性和泛化性能。一、复述（Paraphrasing）复述是通过对句子的词汇、短语或结构进行改写，同时保持原文的语义。这种方法旨在创造语法正确且意义不变的新句子，增加了训练数据的多样性。复述可以通过同义词替换、句法重排、语义保留的翻译等多种方式实现。例如，使用预训练的语义模型进行文本生成，或者运用规则和模板来构造新句子。二、噪声（Noising）噪声注入是在保持标签不变的情况下，向输入数据添加离散或连续的干扰。这些干扰可能包括拼写错误、词汇替换、插入或删除随机字符等，其目的是模拟真实世界中的噪声环境，让模型在处理不完美数据时更具适应性。不过，添加噪声需要谨慎，以避免对语义造成过大影响。三、抽样（Sampling）抽样方法则是根据现有数据分布生成新的样本，目标是增加数据的多样性和覆盖范围。这可以是基于概率的采样，如在文本生成中使用自回归模型，或者在对话系统中模拟用户行为。抽样策略需要确保新生成的样本既具有代表性，又不与原始数据过于相似，以促进模型学习更广泛的模式。论文进一步分析了这些方法在各种NLP任务中的应用，如情感分析、机器翻译、问答系统和对话生成等，并讨论了数据增强面临的挑战。这些挑战包括如何保持语义一致性、避免过度拟合、处理长文本和结构化数据，以及评估增强数据的有效性。关键词：数据增强、自然语言处理 2010MSC分类：00-01, 99-00 通讯作者邮箱：bhli@ir.hit.edu.cn (Bohan Li), ythou@ir.hit.edu.cn (Yutai Hou), car@ir.hit.edu.cn (Wanxiang Che) 这篇预印本论文已提交至《Journal of L》出版，为NLP研究者和从业者提供了关于数据增强的深入理解和实践指导。

Figure 5: Paraphrasing by using semantic embeddings.

r is determined by a geometric distribution with parameter p in which P [r] ∼ p

The index s of the synonym chosen given a word is also determined by a another

geometric distribution in which P [s] ∼ p

. This method ensures synonyms that

are more similar to the original word are selected with a greater probability.

Some methods [41, 42, 43] apply a similar method.

A widely used text augmentation method called EDA (Easy Data Aug-

mentation Techniques) [6] also replaces the original words with their synonyms

using WordNet: they randomly choose n words from the sentence that are not

stop words, and replace each of these words with one of its synonyms chosen

at random, instead of following the geometric distribution.

Zhang et al. [44]

apply a similar method in extreme multi-label classiﬁcation.

In addition to synonyms, Coulombe et al. [7] propose to use hypernyms

to replace the original words. They also recommend the types of words that

are candidates for lexical substitution in order of increasing diﬃculty: adverbs,

adjectives, nouns and verbs. Zuo et al. [45] use WordNet and VerbNet [46] to

retrieve synonyms, hypernyms, and words of the same category.



Thesauruses

Advantage(s):

1. Easy to use.

Limitation(s):

1. The scope and part-of-speech of replacement words are limited.

2. This method cannot solve the problem of ambiguity.

3. The sentence semantics may be aﬀected if too many replacements occur.

2.1.2. Semantic Embeddings

This method overcomes the limitation of the replacement range and word

part-of-speech in the Thesaurus-based method. It uses pre-trained word vectors,

such as Glove, Word2Vec, FastText, etc., and replaces them with the word

closest to the original word in the vector space, as shown in Figure 5.

In the Twitter message classiﬁcation task, Wang et al. [8] pioneer to use

both word embeddings and frame embeddings instead of discrete words.

for word embeddings, each original word in the tweet is replaced with one of

the k-nearest-neighbor words using cosine similarity. For example, “Being late

is terrible” becomes “Being behind are bad”. As for frame semantic embeddings,

n is proportional to the length of the sentence.

The frame embeddings refer to the continuous embeddings of semantic frames [47].

Figure 6: Paraphrasing by using language models.

the authors semantically parse 3.8 million tweets and build a continuous bag-of-

frame model to represent each semantic frame using Word2Vec [48]. The same

data augmentation approach as words is then applied to semantic frames.

Compared to Wang et al. [8], Liu et al. [49] only use word embeddings to

retrieve synonyms. In the meanwhile, they edit the retrieving result with a

thesaurus for balance. RamirezEchavarria et al. [50] create the dictionary of

embeddings for selection.



Semantic Embeddings

Advantage(s):

1. Easy to use.

2. Higher replacement hit rate and wider replacement range.

Limitation(s):

1. This method cannot solve the problem of ambiguity.

2. The sentence semantics may be aﬀected if too many replacements occur.

2.1.3. Language Models

The pre-trained language model has become the mainstream model in recent

years due to its excellent performance. Masked language models (MLMs) such

as BERT and BoBERTa have obtained the ability to predict the masked words

in the text based on the context through pre-training, which can be used for text

data augmentation (as shown in Figure 6). Moreover, this method alleviates the

problem of ambiguity since MLMs consider the whole context.

Jiao et al. [9] use data augmentation to obtain task-speciﬁc distillation train-

ing data. They apply the tokenizer of BERT to tokenize words into multiple

word pieces and form a candidate set for each word piece. Both word embed-

dings and masked language models are used for word replacement. Speciﬁcally,

if a word piece is not a complete word (“est” for example), the candidate set is

made up of its K-nearest-neighbor words by Glove. If the word piece is a com-

plete word, the authors replace it with [MASK] and employ BERT to predict K

Words to form a candidate set. Finally, a probability of 0.4 is used to determine

whether each word piece is replaced by a random word in the candidate set.

Regina et al. [10], Tapia-Téllez et al. [51], Lowell et al. [52] and Palomino et

al. [53] apply a similar method. They mask multiple words in a sentence and

generate new sentences by ﬁlling these masks to generate more varied sentences.

In addition, RNNs are also used for replacing the original word based on the

context ([54, 55]).

剩余41页未读，继续阅读

努力+努力=幸运

粉丝: 17
资源: 136

自然语言处理中的数据增强技术：复述、噪声与抽样

《深度多任务学习》综述论文

【计算机视觉】针对图像的DataAugmentation（数据扩充）介绍，常用方法和Ten。。。 计算机视觉.pdf

Data Augmentation.ipynb

L22 Data Augmentation数据增强

Data Augmentation for ML-driven Data Preparation and Integration

TensorFlow.keras数据增强Data Augmentation

Data-Augmentation-in-YOLOv3:带有数据扩充策略的另一个Yolo实现

YOLOv8 and Natural Language Processing Integration: A Study on Image and Text Information Fusion ...

Data Augmentation Techniques in YOLOv10: The Secret Weapon for Enhancing Model Generalization

【Data Augmentation】: The Application of GANs in Data Augmentation: The Secret to Enhancing Machine...

最新资源

【计算机视觉】针对图像的DataAugmentation（数据扩充）介绍，常用方法和Ten。。。计算机视觉.pdf