基于SMO和模糊模型的文本分类系统

109 浏览量更新于2024-08-29 收藏 206KB PDF 举报

"本文提出了一种基于SMO(Sequential Minimal Optimization)和支持模糊模型的文本分类系统，使用卡方值作为特征选择方法，并结合模糊概念来描述文档的分类标签，利用熵来计算文档分类结果的不确定性。实验结果显示，该方法在文本分类任务中能达到或超过87%的准确性。关键词包括：文本分类、SMO、模糊模型、模糊概念和熵。" 本文是一篇关于文本分类的研究论文，作者来自上海大学计算机工程与科学学院。在介绍部分，作者指出统计学习方法已成为文本分类的主要方法，因为它们相比知识工程方法具有较少的主观因素，并且有坚实的理论基础、明确的评估标准以及良好的性能表现。文章的核心是提出一种结合了SMO算法和模糊模型的文本分类系统。SMO是一种用于支持向量机（SVM）的优化算法，它能有效地解决最大间隔问题，从而在二分类和多分类任务中实现高效和准确的模型训练。在本研究中，SMO被用作文本分类的分类器，这表明作者可能使用SVM来构建模型。特征选择是机器学习中的关键步骤，本文采用了卡方检验（Chi-square）作为特征选择的方法。卡方检验可以评估特征与目标变量之间的关联性，从而选择出对分类最有影响力的特征，降低维度，提高模型效率。同时，文章引入了模糊模型和模糊概念来描述文档的分类标签。模糊模型允许处理不确定性和模糊性，这在文本分类中非常有用，因为文档的主题和类别往往不是绝对清晰的。模糊概念使得分类边界可以更加灵活，适应文本的多样性。熵被用来衡量文档分类结果的不确定性。熵是信息论中的一个概念，用于度量信息的混乱程度。在文本分类中，熵越大，表示文档的分类结果越不确定，需要更多的信息或更复杂的模型来做出准确判断。这篇研究论文提出了一种结合统计学习、特征选择和模糊模型的文本分类新方法，通过SMO优化的SVM分类器，卡方特征选择，模糊概念描述类别，以及熵计算不确定性，提高了文本分类的准确性。这种方法对于处理非结构化数据，如自然语言文本，提供了新的思路和工具。

Text classification based on SMO and fuzzy model

Mengqi Pei, Xing Wu*

School of Computer Engineering and Science

Shanghai University

Shanghai, China

xingwu@shu.edu.cn

Abstract—In this article we propose a text classification sys-

tem using chi-value as feature selection method and SMO (se-

quential minimal optimization) algorithm as classifier. In addi-

tion, we use fuzzy model of fuzzy concept to describe documents’

classified label and entropy to calculate the uncertainty of a doc-

ument’s classification result. Experimental results demonstrated

that the proposed method can reach 87% or higher accuracy of

text classification.

Keywords—text classification; SMO; fuzzy model; fuzzy con-

cept; entropy

I. INTRODUCTION

Nowadays, statistical learning methods have become the

absolute mainstream in text categorization. Because statistical

learning methods have less subjective factors compared with

knowledge engineering methods. Besides, a lot of statistical

learning technologies have a solid theoretical foundation, and

there is a definite evaluation standard and good performance in

statistical learning.

Statistical learning methods rely on effective feature

extraction to get a good learning result, so it is important for

improving machine learning effect to extract effective features

and avoid the noise interference. One effective way to extract

features is chi-value, which compares the contribution of

certain word between one category and others. The method is

widely adopted: Recently, Xiuxia Chen proposed a automatic

web music resource crawler system

[1]

using chi-value as feature

selection method. And Yunfei Qiu proposed an improved chi-

value feature selection method

[2]

. The establishment of the

feature vector mostly use TF-IDF method. And the feature

vector can be described by an algebra model named Vector

Space Model, in the model, each dimension represents a term,

and the value of the dimension set to non-zero if the term

shown in document. After feature extraction, there’s an

algorithm needed to classify samples. According to Yiming

Yang

[3]

, support vector machine (SVM) method based on

vector space model (VSM) works best in text classification.

SVM is a hot classification technology in data mining due to its

simple structure and good classification performance. Huang

Yuqing proposed a SVM with mixed kernel function

[4]

. And

Yuanchao Liu proposed an abstract sentence classification for

scientific papers based on transductive SVM

[5]

. In 1998, an

optimized SVM method named sequential minimal

optimization (SMO) is put forward by Platt John

[6]

. It has

become the fastest quadratic programming optimization

algorithm and gets better performance especially for linear

SVMs and sparse data. In this paper, we use SMO as our

classification algorithm.

The result of classification always be described as crisp set,

which dichotomize testing data into two groups: members and

nonmembers. While many classification concepts do not

exhibit this characteristic. The input document won’t always

specifically belong to one category. Fuzzy set is different from

crisp set because its elements have degrees of membership.

And in this article We use fuzzy model of fuzzy concept which

based on fuzzy semantic models proposed by Yingxu Wang

[7]

to describe our classification result.

Entropy is a measure of the unpredictability in a random

variable, in this paper, we use entropy to be the metric of the

uncertainty of one document’s classification confidence.

II. P

ROPOSED WORK

A. Text classification system based on SMO

In this paper, we build a text classification application with

a whole pipeline. We used chi-value to help us extract features,

and TF-IDF as feature weight to help us describe documents.

Each document is described as a feature vector using vector

space model. And the classification algorithm we chose is

SMO which is an optimized algorithm of SVM.

First, all training documents need to be preprocessed by

segmenting into words and filtering out the stop words.

After preprocessing we get individual words from all

documents, then we can select a subset of the words as

keywords set by using chi-value which is defined as follows:



()

(, )

()()()()

NAD CB

ACBDABC D







where t indicates a term, c indicates a category, N is the

number of training documents, A is the number of documents

who both contain t and belong to c, B is the number of

documents who contain t but not belong to c, C is the number

of documents who does not contain t but belong to c, D is the

number of documents who neither does not contain t and not

belong to c.

Next, the relevant terms from the documents can be

represented by VSM, and the elements of which are TF-IDF

weights. The definition of TF-IDF weight is as follows:



log

max

tf n



Where

indicates TF-IDF weight of term i toward cate-

gory j.

is the term frequency of i in j and

max

is the

maximum term frequncy of i in all categories. N is the total

number of training documents and

is the number of docu-

ments in which the term i appears.

____________________________________



下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38745648

粉丝: 7
资源: 909

基于SMO和模糊模型的文本分类系统

text classification based on Neural Network_神经网络_分词_

A fuzzy deep model based on fuzzy restricted boltzmann machines for high-dimensional data classification

Character Variable Numeralization Based on Dimension Expanding and its Application on Text Classification

yellow River Estuary typical wetlands classification based on hyperspectral

Envionmental Audio Classification Based on Active Learning with SVM

A Text Sentiment Classification Modeling Method Based on Coordinated CNN-LSTM-Attention Model

Hyperspectral remote sensing image classification based on decision level fusion

CMCCR: Classification Based on Multiple Class-Correlation Rules

Semi-supervised classification based on random subspace dimensionality reduction

【船级社】 DNV Offshore Classification based on performance criteria

最新资源