基于VPRS的SVM与KNN两级层次组合文本分类方法

26 浏览量更新于2024-08-27 收藏 636KB PDF 举报

"本文提出了一种两层分层的文本分类组合方法，结合了支持向量机（SVM）和基于变量精度粗糙集（VPRS）的k最近邻（KNN）技术，以提高文本分类的准确性。" 在当前数字化信息爆炸的时代，文本分类作为关键的技术之一，对于数据组织和管理至关重要。每种算法都有其独特的数据偏倚，通过整合不同的算法来构建高性能的分类器是研究的长期目标。本文提出的两层分层算法旨在系统地融合SVM和KNN的优点，利用VPRS改进文本分类的精确度。首先，文章介绍了一种扩展的常规SVM，称为变量精度粗糙SVM（VPRSVM）。VPRSVM将特征空间划分为三种近似区域：边界区域、决策区域和不确定区域。这种方法能更精细地处理特征空间的模糊性和不确定性，提高模型的分类能力。支持向量机通常用于处理非线性问题，通过找到最优超平面来分割数据，而VPRSVM的引入增强了这一过程，使其能够更好地处理复杂的数据分布。其次，文中还提出了一种改进的KNN算法。KNN算法以其简单且直观的策略在文本分类中得到广泛应用，即通过查找与待分类文本最接近的k个邻居来决定其类别。然而，原始KNN算法可能受噪声和异常值的影响。通过结合VPRS，该改进的KNN算法能够更好地处理这些不确定性，从而提升分类的稳定性和准确性。两层分层结构的设计使得算法能够在不同层次上进行决策，首先应用VPRSVM对数据进行初步分类，然后利用改进的KNN算法对初步分类结果进行细化和校正。这种层次化的策略允许算法在粗略分类的基础上进行精细化调整，进一步优化分类性能。关键词：文本分类、组合方法、变量精度粗糙集、支持向量机、k最近邻本文提出的两层分层组合方法利用VPRSVM和改进的KNN算法，针对文本分类中的挑战提供了新的解决方案，通过层次化整合不同算法的优势，提高了分类的准确性和鲁棒性。这一方法对于处理大规模、复杂文本数据的分类任务具有潜在的应用价值。

Two-level hierarchical combination method for text classiﬁcation

Wen Li

a,b,

⇑

, Duoqian Miao

, Weili Wang

a,b

Department of Computer Science and Technology, Tongji University, Shanghai 201804, China

Information Engineering School, Nanchang University, Nanchang 330031, China

article info

Keywords:

Text classiﬁcation

Combination method

Variable precision rough sets

Support vector machine

k nearest neighbor

abstract

Text classiﬁcation has been recognized as one of the key techniques in organizing digital data. The intu-

ition that each algorithm has its bias data and build a high performance classiﬁer via some combination

of different algorithm is a long motivation. In this paper, we proposed a two-level hierarchical algorithm

that systematically combines the strength of support vector machine (SVM) and k nearest neighbor

(KNN) techniques based on variable precision rough sets (VPRS) to improve the precision of text classi-

ﬁcation. First, an extension of regular SVM named variable precision rough SVM (VPRSVM), which parti-

tions the feature space into three kinds of approximation regions, is presented. Second, a modiﬁed KNN

algorithm named restrictive k nearest neighbor (RKNN) is put forward to reclassify texts in boundary

region effectively and efﬁciently. The proposed algorithm overcomes the drawbacks of sensitive to noises

of SVM and low efﬁciency of KNN. Experimental results compared with traditional algorithms indicate

that the proposed method can improve the overall performance signiﬁcantly.

1. Introduction

Text classiﬁcation (TC), also known as text categorization, aims

at automating the process that assigns documents to a set of pre-

viously ﬁxed categories, has always been a hot topic. Many popular

algorithms have been applied to text categorization. No Free Lunch

(NFL) theorems (Wolpert & Macready, 1997) have shown that

learning algorithms cannot be universally acceptable and any algo-

rithm has its bias data. When the data ﬁts the underlying classiﬁ-

cation strategy well, the system accuracy can be very high, and vice

versa (Tan, Cheng, & Ghanem, 2005). Among the many well-known

algorithms, support vector machine (SVM) (Joachims, 1998) and k

nearest neighbor (kNN) (Cover & Hart, 1967) are widely used be-

cause their excellent learning performance both in theory and in

practices. But despite their advantages, they also have weaknesses

and limitations.

SVM is well founded in terms of computational learning theory

and very open to theoretical understanding. The ﬁnal classiﬁer

obtained by the SVM depends only on a small portion of the train-

ing samples, i.e. support vectors, which is good for implementa-

tion. However, this makes the SVM sensitive to noises or outliers

and patterns that were wrongly classiﬁed lie near the separation

hyper-plane (Zhang & Wang, 2008).

KNN is a well-known statically approach in pattern recognition.

It is also known as one of the top-performing methods on the

benchmark Reuters corpus (Yang & Liu, 1999). Because of using

an instance-based learning algorithm, the KNN algorithm simply

stores all of the training examples as classiﬁer and delay learning

until prediction phase. Under circumstance of huge amount of

training data, considerable time would be paid during the classiﬁ-

cation process in KNN. Besides, the performance of KNN may be

affected by noisy data (Srisawat, Phienthrakul, & Kijsirikul, 2006).

Researchers have long pursued the promise of harnessing mul-

tiple text classiﬁers to synthesize a more accurate classiﬁcation

procedure via some combination of the outputs of the contributing

classiﬁers (Bennett, Dumais, & Horvitz, 2005). In this paper, we

present a hybrid algorithm based on variable precision rough sets

(VPRS) by combining the respective excellences of SVM and KNN in

order to improve classiﬁcation accuracy. The proposed method is

based on a two-stage algorithm. First, by introducing the VPRS the-

ory into the support vector machines, a variable precision rough

SVM (VPRSVM) is presented. The transformed feature space is par-

titioned by using VPRSVM where lower and upper approximations

of each category are deﬁned. Second, on analysis of the character-

istic of boundary region text, a modiﬁed KNN algorithm, namely

restrictive k nearest neighbor (RKNN) classiﬁer is put forward

which built on the reduced candidate classes, and it only requires

classifying testing document of boundary region effectively and

efﬁciently.

Since uncertainties in the labeling are taken into account, our

approach tries to provide a practical mechanism to deal with

real-world noisy text data. Analysis of the different approximation

doi:10.1016/j.eswa.2010.07.139

⇑

Corresponding author at: Department of Computer Science and Technology,

Tongji University, Shanghai 201804, China. Tel.: +86 15900799568.

E-mail addresses: jx_wenli@yahoo.com.cn (W. Li), miaoduoqian@163.com

(D. Miao), ken.wlwang@gmail.com (W. Wang).

Expert Systems with Applications 38 (2011) 2030–2039

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38740201

粉丝: 7
资源: 949

基于VPRS的SVM与KNN两级层次组合文本分类方法

Convolutional-Recursive Deep Learning for 3D Object Classification

Hierarchical-Double-Attention-Neural-Networks-for-Sentiment-Classification:分层双注意力神经网络在情感分类中的应用

A Two-Task Hierarchical Constrained Tri-objective Optimization approach for Vehicle State Estimation under non-Gaussian Environment

Multi-Feature Max-Margin Hierarchical Bayesian Model for Action Recognition

Location-based Hierarchical Matrix Factorization for Web Service Recommendation

Hierarchical human-like strategy for aspect-level sentiment classification with sentiment linguistic knowledge and reinforcement learning

Image-level classification by hierarchical structure learning with visual and semantic similarities

庞亮__HAS-QA Hierarchical Answer Spans Model for Open-domain Quest

deep-activity-rec:论文 ibrahim et al, cvpr 2016 - A Hierarchical Deep Temporal Model for Group Activity Recognition -

21-KDD-Scalable Hierarchical Agglomerative Clustering

最新资源