短文本特征选择：基于共现距离与强分类特性的有效方法

18 浏览量更新于2024-08-28 收藏 377KB PDF 举报

本文探讨了短文本特征选择的一种新颖方法，着重于结合术语共现距离和强分类特征。在处理短文本时，由于信息密集且词汇稀疏，有效的特征选择对于提高模型性能至关重要。传统的方法往往依赖于词频或TF-IDF等统计量，然而，这些方法可能无法充分捕捉词语之间的语义关联。首先，作者引入了术语共现距离的概念，这是衡量一个词语在文本中与其邻近词语出现的频率和模式的重要指标。通过计算文档内各个术语的共现距离，可以揭示词语之间的关联强度，从而为每个术语赋予相关权重。这种方法有助于区分那些在上下文中频繁一起出现但意义相关性强的术语，与那些孤立出现或者关联较弱的词语区分开来。其次，为了增强特征选择的分类能力，作者提出了改进的期望交叉熵（Improved Expected Cross Entropy，简称IECE）。这个改进的度量方式旨在捕捉词语对分类结果的显著指示性，即一个词语在特定类别中出现时，它对分类任务的贡献程度。IECE能够量化词语与类别之间的强关联，从而更准确地确定每个术语在不同类别下的权重。在实际操作中，每个类别的所有术语根据它们的权重进行排序，选取排名靠前的k个术语作为特征项。这种策略确保了被选入的特征能最大程度上反映文本的类别特性，从而提升短文本分类模型的性能。实验部分验证了这种方法的有效性，结果显示，与传统的特征选择策略相比，基于共现距离和强分类特征的方法能够在保持信息丰富度的同时，显著减少噪声特征，提高了短文本特征选择的效率和准确性。这对于诸如情感分析、主题分类等短文本处理任务来说，具有重要的实践价值。本文提出了一种创新的短文本特征选择方法，通过结合共现距离和强分类特征，不仅增强了特征与文本类别之间的关联性，还提升了特征选择的针对性，为后续的文本挖掘和机器学习应用提供了有力的支持。未来的研究可以进一步探索如何优化共现距离的计算以及IECE的改进，以适应更多样化的文本类型和应用场景。

Leveraging Term Co-occurrence Distance

and Strong Classiﬁcation Features for Short

Text Feature Selection

Huifang Ma

(&)

, Yuying Xing, Shuang Wang, and Miao Li

College of Computer Science and Engineering, Northwest Normal University,

Lanzhou, China

mahuifang@yeah.net

Abstract. In this paper, a short text feature selection method based on term

co-occurrence distance and strong classiﬁcation features is presented. On the one

hand, co-occurrence distance between terms in each document is considered to

determine the co-occurrence distance correlation, based on which the correlation

weight for each term can be deﬁned. On the other hand, the improved expected

cross entropy is deﬁned to obtain the weight of a term in a particular class with

strong class indication. All terms of each class is sorted in a descending order

based on their weights and top-k terms are selected as feature terms. Experiments

show that our method can improve the effectiveness of short text feature selection.

Keywords: Short text

 Co-occurrence distance  Strong classi ﬁcation feature 

Expected cross entropy  Feature selection

1 Introduction

In recent years, with the rapid growing of Web and social media, more and more

information exist in the form of short texts and tend to grow explosively. Different

kinds of feature selection approaches have been put forward to reduce dimensionality in

the past years. To be more speciﬁc, there are two main methods of feature extraction

[1]. One is feature selection, which refers to choosing a subset of features from the

original features and the feature space is optimally reduced by a certain criterion. The

other is feature extraction, which means that a set of new features is constructed from

the original features. They are used either in isolation or in combination.

Term weighting has been proved to be an effective way to improve the expres-

siveness of short text classiﬁcation. There are two kinds of traditional weighting

methods: unsupervised methods, such as term frequency (TF), term frequency-inverse

document frequency (TF*IDF) [2] and supervised methods, such as information gain

(IG), expected cross entropy (ECE), tf*v

and so on. From the point of view of

co-occurrence between terms, the two terms are considered to be related if they fre-

quently co-occur wi th each other in the entire corpus. Due to the fact that short text

contains few words, the co-occurrence distance between two terms can also cause a

certain inﬂuence on their semantic relation. Standi ng from the angle of different classes,

if one term distributes more evenly between each class, it hardly makes any contribution

G. Li et al. (Eds.): KSEM 2017, LNAI 10412, pp. 67–75, 2017.

DOI: 10.1007/978-3-319-63558-3_6

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38713009

粉丝: 8
资源: 919

短文本特征选择：基于共现距离与强分类特性的有效方法

TextCNN模型相关1

一种面向基因与疾病关系的文本挖掘方法

自然语言处理算法在文本挖掘中的应用：信息提取与文本分类，释放文本价值

深度学习在文本分类中的应用

BERT的词汇表与词嵌入：如何利用字典处理文本

基于CNN-SSA-BiLSTM模型的文本分类实践指南

FastText文本表示：在文本规范化中的应用，统一文本格式，提升数据处理效率，提高准确性

R语言文本挖掘实战：从零基础到文本数据分析专家

图表讲故事：matplotlib中的文本和字体定制指南

FastText文本表示：在自然语言处理中的应用，解锁NLP新天地，赋能文本理解

最新资源