高质量特征扩展模式提升中文短文本分类效果

PDF格式 | 436KB | 更新于2024-08-27 | 122 浏览量 | 举报

"这篇论文提出了一种针对中文短文本分类的新方法，主要关注高质量特征扩展模式的提取和应用。在该方法中，特征扩展模式被视为训练数据中具有共现关系的术语集合，通过评估其置信度、类别同质性和相关强度来确定其质量。论文介绍了一种算法用于从训练数据中抽取这些高质量特征扩展模式，并进一步展示了如何将这些模式应用于短文本分类，通过添加新特征或调整初始特征权重来增强文本表示。实验结果表明，这种方法能有效提升中文短文本分类的性能，且优于传统文本分类技术。" 在中文短文本分类问题上，由于文本信息往往较为简洁，概念表达不够明确，这给分类带来了挑战。本文提出的解决方案是利用高质量特征扩展模式。特征扩展模式是一种基于训练数据中术语共现关系的结构，它可以揭示文本中的潜在语义信息。作者提出了三个关键指标来衡量特征扩展模式的质量： 1. **置信度 (Confidence)**：度量一个特征扩展模式在训练数据中的稳定性和可靠性，即该模式出现的频率与其随机出现概率之间的差异。 2. **类别同质性 (Category Homoplasy)**：评估模式是否倾向于出现在同一类别文本中，这有助于确保提取的模式能够区分不同的文本类别。 3. **相关强度 (Relevance Strength)**：衡量特征扩展模式与目标分类任务的相关性，确保模式能够有效地影响分类决策。为了从大量数据中提取这些高质量的特征扩展模式，论文设计了一种算法。该算法可能包括查找频繁项集、计算上述度量值以及设置阈值以过滤低质量模式等步骤。接下来，论文提出了一个利用特征扩展模式的中文短文本分类算法。在这个过程中，原始的短文本特征通过两种方式被扩展：一是增加新的特征，这些新特征基于特征扩展模式生成；二是调整初始特征的权重，根据与特征扩展模式的关系进行优化。这种方法考虑了非特征术语之间的相互作用，从而更全面地捕捉文本的语义信息。实验结果证明，这种利用高质量特征扩展模式的方法在中文短文本分类任务上表现优越，提高了分类准确性，同时与传统的文本分类方法相比，展现出更好的性能。这表明在处理信息稀疏的短文本时，考虑词汇共现关系和模式质量是至关重要的，为中文文本处理领域提供了一种有效的工具和思路。

Utilizing High-quality Feature Extension Mode to

Classify Chinese Short-text

Xinghua Fan

College of Computer Science and Technology

Chongqing University of Posts and Telecommunications, Chongqing 400065, China

fanxh@cqupt.edu.cn

Hongge Hu

College of Computer Science and Technology

Chongqing University of Posts and Telecommunications, Chongqing 400065, China

huge120806@163.com

Abstract—This paper presents a method of classifying

Chinese short-texts that have weak concept signal, in which

high-quality feature extension modes are extracted and used

effectively. In the method, a feature extension mode is

considered as a set of terms that have co-occurrence

relationship in the training data, and three measures that

decide whether it is high-quality, i.e., Confidence, category

homoplasy and relevancy strength, are presented. Then, an

algorithm, which extracts high-quality feature extension

modes from training data, is designed. Next, Chinese short-

text classification algorithm utilizing feature extension

modes is presented, in which a short-text is extended by

adding new features or modifying the weights of initial

features, according to the relationship between non-feature

term and feature extension mode. The experiments show

that (1) A high-quality feature extension mode is helpful to

improve Chinese short-text classification; (2) the proposed

method can obtain a higher classification performance

comparing with the conventional text classification methods.

Index Terms—Chinese short-text classification, co-

occurrence relationship, high-quality feature extension

mode, feature extension

I. INTRODUCTION

With the rapid development of information technology,

the form of information transmission has been being

endlessly enriched. As the major representative form of

information such as SMS, online charting and Netizen

comments, short-text, which generally has no more than

160 character, becomes the important channel for the

dissemination of public information. But taking the short-

text as the carrier, propagation of all kinds of pornography,

violence, rumor, reactionary remarks, fraud and illegal

advertising in network will inevitably become the hidden

trouble of social stability with the fast explosive growth of

short-text. So we must effectively monitor, intercept and

filter the harmful information related to that. Short-text

classification technology [1-5], which delivers the short-

text to some pre-defined classes based on the content

analysis, is a kind of effective way accounting for above

mentioned problems.

Owing to the fact that the short texts have inherent

defects such as short length, weak concept signal and high

ambiguity, the short-text categorization is a very

challenging task. So far, there are two thoughts to handle

the difficulties in the short-text classification: One is

making use of extra information in external resources such

as HowNet, background corpus to assist classification [3-

5]; the other is mining the internal implied information to

offer helps for categorization [6-7]. Although the first

approach could make the information quantity of the

short-text enhanced, it has a strong dependency on the

build an access of external resources; besides, it is a tough

work to ensure the homogeneity of the additional

information and internal information. Aiming at the

second way, Zelikovitz [6-7] tried to exploit the internal

relations of terms by the singular matrix decomposition

and achieved a certain effect, but it asked too much of the

computer’s handling capacity while processing large

amount of data.

In this paper, we focus on the first approach and

implement the thought by utilizing the co-occurrence

relationships hidden in the training data which is set as

background corpus to classify Chinese short-text. As

important extra information, the co-occurrence

relationship, which is a set of terms that have co-

occurrence relationship in the background corpus, is

helpful for short-text classification [5]. The process such

as in [5], which introduced simple co-occurrence

relationship and didn’t consider its quality, may bring

noise into short-text classification, and leads that it is

difficult to get a satisfied classification result. While we

call the collection of co-occurrence relationships among

terms for short-text classification as feature extension

mode library, it is obvious that, the core issues of

improving classification performance are build a high-

quality feature extension mode library and to find a

perfect method of utilizing the high-quality feature

extension modes. So that needs to solve the following

problems. (1) What measures can determine whether a

feature extension mode, i.e., co-occurrence relationship, is

high quality? (2) How to utilize the measures to extract

JOURNAL OF NETWORKS, VOL. 5, NO. 12, DECEMBER 2010 1417

doi:10.4304/jnw.5.12.1417-1425

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38617604

粉丝: 4

高质量特征扩展模式提升中文短文本分类效果

pyzor-0.4.0.tar.gz_文本分类

易语言模块文本处理扩展模块.rar

Neovim:超可扩展的基于 Vim 的文本编辑器-开源

基于博客文本分类

emacs-grammarly：一个Emacs扩展，用于从Grammarly发送文本

Check123-crx插件：一键为文本添加高质量视频

Emacs扩展emacs-grammarly：实现文本直接从Grammarly发送

Phaser3健康条插件：带滚动战斗文本的扩展

Word2Vec词嵌入在文本生成器中的应用：赋能文本生成，创造高质量文本内容

QRegExp递归模式解析：复杂文本匹配的高级解决方案

最新资源