短文本分类新方法：融合词汇类别与语义特征

195 浏览量更新于2024-08-27 收藏 584KB PDF 举报

本文主要探讨了如何有效地对短文本进行分类，这是由于短文本的特性，如严重的信息稀疏性和高维度，使得传统的分类方法面临挑战。研究者们针对这些特性提出了一个新的分类策略，该策略结合了词汇特征和语义特征的利用。首先，作者构建了一个术语词典，通过选择每个类别中最具代表性的词汇作为特征。这种方法旨在增强文本表达的精确性，确保分类模型能捕捉到关键信息。选择过程可能涉及到统计分析或者领域专家的知识，以确保选出的词汇能准确反映文本的主题或类别。接着，研究者引入了潜在狄利克雷分配（Latent Dirichlet Allocation, LDA）这一强大的工具，从背景知识库中提取出最优化的话题分布。LDA是一种无监督机器学习技术，常用于主题建模，它能自动发现文档中的隐藏话题结构，并为每个文本分配一个概率分布，表示文本包含各个主题的程度。这样做的目的是捕捉文本的深层次语义信息，提高分类的准确性。在得到词汇特征和优化的主题分布后，研究者将这两类特征结合起来，构造新的短文本特征向量。这种融合有助于捕捉文本的多维度特性，不仅考虑了词频和词语选择，还考虑了文本的主题内容和潜在语义关联。实验结果显示，这种基于改进的词汇类别和语义特征的短文本分类方法显著提高了分类的性能和质量。对比其他传统方法，它在处理信息稀疏和高维度问题上表现更优，为短文本分类领域的实际应用提供了有力的支持。关键词：短文本分类、潜在狄利克雷分配、词汇特征、语义特征、最优话题分布。这项研究对于提升文本挖掘的效率和精度具有重要意义，也为后续的研究者在处理类似问题时提供了新的思路和技术支持。

2.1 Expected Cross Entropy

Expected cross entropy (ECE) is a kind of feature selection measure based on the

information theory, which considers both word frequency and the relationship between

word and category. The bigger the ECE value, the more informative a featu re has for

the purpose classiﬁcation. ECE value of word t

is usually calculated as follows:

ECEðt

Þ¼Pðt

PðC

Þ log

PðC

ð1Þ

Where, t

represents for a word, C

represents for category j, P(C

| t

) represents that

when contains the word t

the probability of belong to the C

2.2 Correlation Weight

Correlation weight (COW)isanefﬁcient way of term weighting for short text, which

considers the correlation of terms within a short text [8]. Concretely, the conditional

probability is used to model the probability that terms appear together in a short text

and the probabilistic correlation of terms is deﬁned in a symmetric way as follows:

cor t

; t



¼ pt



 pt



ð2Þ

which represents the probabilistic degree that words t

and t

belong to the same short

texts.

The correlation weight denotes the reliability and importance of the word t

in the

short text. A higher correlation weight implies a higher probability that if word t

appears in the short text, other words t

will also appear in the short texts. In other

words, the more words t

that show high correlation with word t

, the higher the

probability word t

is relevant to the short text. Given a short text d

with an initial

weight w

(term frequency) of each word t

, the correlation weight of word t

in the short

text d

is deﬁned as:

COWðt

Þ¼w

 corðt

; t

ð3Þ

Where, |d

| means the total number of words in the short text d

2.3 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA), ﬁrst introduced by Yang et al. [9], is a generative

probability model of a corpus that can be used to estimate the multinomial observations

by unsupervised learning. It can be used to model and discover underlying topic

structures of any kind of discrete data in which text is a typical example. The basic idea

is that documents are characterized as multinomial distributions over latent topic,

meanwhile each topic is represented by a multinomial distribution over words.

Effectively Classifying Short Texts via Improved Lexical Category 165

剩余11页未读，继续阅读

weixin_38720009

粉丝: 4

短文本分类新方法：融合词汇类别与语义特征

Semantic 3D Object Maps

Semantic Software Design

Image object detection and semantic segmentation based on convolutional neural network

Effectively and Securely Using the Cloud Computing Paradigm

Image Denoising Via Sparse and Redundant Representations

Improved modulation format identification based on Stokes parameters using combination of fuzzy c-means and hierarchical clustering in coherent optical communication system

how to communicate effectively

Version Control and Collaboration Features in Jupyter Notebook

【Foundation】Feature Extraction of Speech Signals in MATLAB: Understanding MFCC and LPCC Features

Optimizing Work Efficiency with Tabs and Split Screen Features in MobaXterm

最新资源