基于词嵌入聚类与卷积神经网络的短文本语义扩展提升分类性能

61 浏览量更新于2024-08-29 收藏 1.1MB PDF 举报

本文探讨了一种新颖的方法，旨在通过词嵌入聚类和卷积神经网络（CNN）来提升短文本分类的性能。随着大数据时代的到来，有效处理和利用文档中的隐藏信息对文本分类至关重要。然而，短文本数据的稀疏性和语义对上下文的敏感性常常成为提高分类精度的挑战。为了克服这些局限，研究者提出了一种统一的框架，它结合了词嵌入技术与聚类算法，以及深度学习的CNN模型。首先，词嵌入是一种将单词转换为连续向量表示的技术，能够捕捉词汇间的语义关系。在这个框架中，作者利用词嵌入将短文本中的词语转化为密集的向量表示，这有助于保留原始信息的同时，减少了因数据稀疏导致的分类难题。词嵌入聚类则进一步对相似的词向量进行分组，形成语义单元，这些单元代表了共享的语义特征。这样，即使在数据稀缺的情况下，也能增强文本的表达能力，为后续的分类任务提供更为丰富的信息。然后，引入了卷积神经网络（CNN），一种在自然语言处理领域表现出色的深度学习模型。CNN通过滑动窗口的方式对文本进行局部特征提取，能够捕捉到不同长度的n-gram模式，这对于短文本尤其重要，因为它们可能缺乏全局上下文信息。通过堆叠多层卷积层和池化层，CNN能够逐渐提取出文本的高层次特征，提高了模型对文本语义的理解和识别能力。最后，整个方法将词嵌入聚类和CNN模型无缝整合，通过预处理阶段的词向量聚类，然后在CNN中利用这些语义单元作为输入，使得模型能够更有效地理解和处理短文本中的复杂语义。实验结果表明，这种融合策略显著提高了短文本分类的准确性和鲁棒性，证明了其在实际应用中的有效性。本文的研究为短文本分类问题提供了一个创新的解决方案，通过词嵌入聚类和CNN的有效结合，有效地解决了数据稀疏性和语义敏感性问题，为提高文本分类性能开辟了新的路径。这项工作不仅提升了学术界对深度学习在文本分析领域的认识，也为实际场景中的信息检索、情感分析等应用提供了有力支持。

Semantic expansion using word embedding clustering and

convolutional neural network for improving short text classiﬁcation

Peng Wang

,BoXu

, Jiaming Xu

, Guanhua Tian

, Cheng-Lin Liu

a,b

, Hongwei Hao

Institute of Automation, Chinese Academy of Sciences, Beijing 100190, PR China

National Laboratory of Pattern Recognition (NLPR), Beijing 100190, PR China

article info

Article history:

Received 4 May 2015

Received in revised form

22 June 2015

Accepted 30 September 2015

Communicated by Jinhui Tang

Available online 9 October 2015

Keywords:

Short text

Classiﬁcation

Clustering

Convolutional neural network

Semantic units

Word embeddings

abstract

Text classiﬁcation can help users to effectively handle and exploit useful information hidden in large-

scale documents. However, the sparsity of data and the semantic sensitivity to context often hinder the

classiﬁcation performance of short texts. In order to overcome the weakness, we propose a uniﬁed

framework to expand short texts based on word embedding clustering and convolutional neural network

(CNN). Empirically, the semantically related words are usually close to each other in embedding spaces.

Thus, we ﬁrst discover semantic cliques via fast clustering. Then, by using additive composition over

word embeddings from context with variable window width, the representations of multi-scale semantic

units

in short texts are computed. In embedding spaces, the restricted nearest word embeddings

(NWEs)

of the semantic units are chosen to constitute expanded matrices, where the semantic cliques

are used as supervision information. Finally, for a short text, the projected matrix

and expanded

matrices are combined and fed into CNN in parallel. Experimental results on two open benchmarks

validate the effectiveness of the proposed method.

1. Introduction

The classiﬁcation of short texts, such as search snippets, micro-

blogs, product reviews, and short messages, plays important roles in

user intent understanding, q uestion answering and intelligent

information retrieval [1]. Since short texts do not provide enough

conte xtual information, the data sparsity problem is easily encoun-

tered [2]. Thus, the general methods based on bag-of-words (BoW)

model cannot be directly applied to short te xts [1],becausetheBoW

model ignores the order and semantic relations between words. How

to acquire effective representations of short texts to enhance the

catego rization performance has been an active research issue [2,3].

Conventional text classiﬁcation methods often expand short

texts using latent semantics, learned by latent Dirichlet allocation

(LDA) [4] and its extensions. Phan et al. [3] presented a general

framework to expand short and sparse texts by appending topic

names, discovered using LDA over Wikipedia. Sahami and Heilman

[5] enriched text representation by web search results using the

short text segment as a query. Furthermore, Yan et al. [6] pre-

sented a variant of LDA, dubbed biterm topic model (BTM), espe-

cially for short text modeling to alleviate the data sparsity pro-

blem. However, these methods still consider a text as BoW.

Therefore, they are not effective in capturing ﬁne-grained

semantics for short texts modeling.

More recently, deep learning based methods have drawn much

attentions in the ﬁeld of natural language processing (NLP), which

mainly evolved into two branches. One is to learn word embed-

dings by training language models [7–10], and another is to per-

form semantic composition to obtain phrase or sentence level

representation [11,12]. Word embeddings, also known as dis-

tributed representations, typically represent words with dense,

low-dimensional and real-valued vectors. Each dimension of the

vectors encodes a different aspect of words. In embedding spaces,

semantically close words are likely to cluster together and form

semantic cliques. Moreover, the embedding spaces exhibit linear

structure that the word embeddings can be meaningfully com-

bined using simple vector addition [9].

In this paper, we aim to obtain the semantic representations of

short texts and overcome the weakness of conventional methods.

Similar to Li et al. [13] that cluster indicators learned by non-

negative spectral clustering are used to provide label information

for structural learning, we develop a novel method to model short

texts using word embeddings clustering and convolutional neural

network (CNN). For concision, we abbreviate our methods to

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2015.09.096

Corresponding author.

Semantic units are deﬁned as n-grams which have dominant meaning of text.

With n varying, multi-scale contextual information can be exploited.

In order to prevent outliers, a Euclidean distance threshold is preset between

semantic cliques and semantic units, which is used as restricted condition.

The projected matrix is obtained by table looking up, which encodes Unigram

level features.

Neurocomputing 174 (2016) 806–814

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38713996

粉丝: 7
资源: 919

基于词嵌入聚类与卷积神经网络的短文本语义扩展提升分类性能

深度学习驱动的阿里小蜜：双11实践与技术创新

复杂异构数据表征学习研究综述：挑战、方法与应用

深度学习提升对象检测：R-CNN方法的突破

A new deep distortion convolutional neural network for semantic

STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation

Image object detection and semantic segmentation based on convolutional neural network

Gated Convolutional Neural Network

Semantic Breast Tumor Segmentation by CNN: Semantic Breast Tumor Segmentation by Convolutional Neural Network in MRI 图像-matlab开发

Deep Learning and Convolutional Neural Networks for Medical Image Computing

【Advanced】Image Recognition in MATLAB: Using Convolutional Neural Networks for Image Recognition

最新资源