大规模监督数据生成方法：短文本情感分析新途径

86 浏览量更新于2024-07-15 收藏 1.14MB PDF 举报

"这篇研究论文提出了一种新颖的方法来生成大量监督数据，用于短文本情感分析。该方法针对自然语言处理中的复杂性、语义结构和标记数据的相对稀缺性，特别是对于短文本处理的挑战，提出了深度学习模型训练所需的大规模数据生成策略。通过多粒度文本导向的数据增强技术，生成人工训练数据，并与生成对抗网络（GAN）进行了比较。此外，论文还提出了一种混合神经网络模型架构（LSCNN），在应用了数据增强技术后，该模型的表现超过了多种单一神经网络模型。" 文章内容深入探讨了在短文本情感分析中面临的困难，主要在于语言结构的复杂性、语义的多层次性和标注数据的不足。由于这些因素，情感分析被视为自然语言处理中的一个难题。特别是对于短文本，由于信息量有限，情感判断更为困难。为了克服这些问题，作者提出了多粒度文本导向的数据增强技术。这种方法旨在通过生成模拟真实世界的、带有标签的大量训练数据，来弥补真实标注数据的不足。数据增强技术可以丰富模型的输入多样性，帮助模型更好地理解和学习文本的多维度特征，从而降低数据稀疏性和防止过拟合。同时，论文中还介绍了一种名为LSCNN（可能是Local-Global Convolutional Neural Network的缩写）的新型混合神经网络模型。LSCNN设计的目标是结合局部和全局的上下文信息，以提高对短文本情感分析的准确性。在应用了数据增强技术之后，LSCNN模型的性能得到了显著提升，超越了其他单一的神经网络模型。此外，论文还提及了将生成对抗网络（GAN）作为比较基准。GAN是一种强大的生成模型，能够生成逼真的新样本，但可能在处理文本数据时面临挑战。通过对比，论文表明所提出的多粒度数据增强技术和LSCNN模型在特定的短文本情感分析任务上具有优势。这篇研究论文为解决短文本情感分析中的数据稀缺问题提供了一个创新的解决方案，通过数据增强技术和混合神经网络模型的结合，提升了模型的性能和泛化能力。这为后续的自然语言处理研究和实际应用提供了有价值的参考。

Multimed Tools Appl

words or phrases are introduce as noise to the sample sentences. The position and num-

ber of meaningless words to be inserted are automatically and randomly. Given a text

or sentence S(w

...w

), we randomly insert m(m ∈

[

1, 10

]

)meaningless words to



(

,</s >,w

...< /s >,w

)

,where‘</s>’ denotes the noise words that are

inserted into the sentence or text.

The proposed word-level data augmentation method such as substitution, translation,

and insertion operations are running automatically and randomly. Random processing may

introduce more diversity of text. The detail of the algorithm is described as Algorithm 1.

Algorithm 1 Data augmentation for word-level

Require: Text String ThesaurusHashMap Word2Vec

1: function MAIN(text thesaurus dropWord dropDic)

2: POS ADVADJ NOUN

3: result null

4: words filterByPos PosTag text POS thesaurus

5: permutationgetPermutationwords dropWords

6: for 0 words.length do

7: synonyms.add(thesaurus.get(words[i]))

8: synonyms.add(word2vec.getNeighbors(words[i]))

9: end for

10: combinationgetCombination synonymsdropDic

11: for 0 permutation.length do

12: temp text.replace permutation combination

13: result.add temp

14: end for

15: TRANSLATE sentence flips result

16: INSERT sentence inserts result

17: return result

18: end function

19:

20:

function TRANSLATE(sentence num list)

21: for 0 num do

22: tempsentence

23: list.add ‘ num temp

24: end for

25: return list

26: end function

27:

28:

function INSERT(sentence num list)

29: for 0 num do

30: words sentence.split

31: words.add random.words.size

32: list.add array2Str words

33: words.clear

34: end for

35: return list

36: end function

剩余20页未读，继续阅读

weixin_38707217

粉丝: 3
资源: 903

大规模监督数据生成方法：短文本情感分析新途径

Semi Supervised Learning for Few Shot Image to Image Translation

A novel scheme to generate 40-GHz CSRZ pulse trains using a 10-GHz dual-parallel Mach-Zehnder modulator

PhotoWall:Just sort out the photos and generate the photo wall with one click; Realize the alignment of a large number of photos of any scale without stretching and masking the photos; Completed by JS CSS only, independent of node or server.只需整理好照片，一键生成照

A new method to generate flattened Gaussian beam by incoherent combination of cosh Gaussian beams

Novel WDM-PON architecture for simultaneous transmission of unicast data and multicast services

MATLAB Sparse Matrix Representation: Efficient Processing of Large-Scale Sparse Data, 2 Key ...

Generate a stream of binary data

最新资源