CUSBoost: 分类不平衡问题的聚类下采样AdaBoost算法

下载需积分: 48 | PDF格式 | 292KB | 更新于2024-08-25 | 120 浏览量 | 举报

2 收藏

"这篇论文提出了一种新的基于聚类的欠采样方法，结合提升算法(CUSBoost)来处理不平衡分类问题。针对多数类过度代表而少数类样本不足的情况，CUSBoost旨在改善传统Adaboost在多分类任务中的性能，尤其是在处理具有高度不平衡数据集时的挑战。" Adaboost算法是一种集成学习方法，最初设计用于二分类问题，通过迭代地调整数据权重和训练弱分类器，使得每次迭代重点关注前一轮被错误分类的样本，从而构建出一个强分类器。其核心思想是通过多次迭代，每个迭代中训练一个简单的分类器（如决策树），然后根据该分类器的性能赋予其不同的权重，最后将所有分类器组合起来形成一个最终的预测模型。然而，在多分类任务中，Adaboost的直接应用可能无法有效地处理不平衡数据集，即不同类别的样本数量相差悬殊。这会导致模型过于关注数量占优的类别，而忽视了少数类别的样本，从而影响分类的准确性和鲁棒性。针对这个问题，CUSBoost提出了一个新的策略。 CUSBoost首先采用聚类方法对多数类样本进行欠采样，即减少多数类的样本量，使得各类别之间的样本数量更加平衡。这种欠采样策略有助于防止模型过拟合在多数类上，同时确保少数类样本的重要性得到提高。接着，结合Adaboost算法的迭代过程，不断调整样本权重和训练弱分类器，确保在每个迭代中，聚类后的样本能更均匀地代表各个类别。此外，CUSBoost还利用Boosting的思想，通过结合多个弱分类器形成强分类器。每个弱分类器在训练过程中都会重点关注前一轮中分类错误或困难的样本，这样在整个集成中，每个弱分类器都负责解决特定子问题，从而提高整体分类性能。论文中提到，这种方法在处理不平衡数据集时，能够显著提高对少数类别的识别能力，从而提高分类的整体准确率。 CUSBoost是针对不平衡分类问题的一种有效解决方案，它通过聚类和Adaboost的结合，优化了传统Adaboost在多分类任务中的性能，尤其适用于那些重视少数类样本的应用场景，如医疗诊断、金融风险评估等。

展开

CUSBoost: Cluster-based Under-sampling with

Boosting for Imbalanced Classiﬁcation

Farshid Rayhan, Sajid Ahmed, Asif Mahbub, Md. Rafsan Jani,

Swakkhar Shatabda, and Dewan Md. Farid

Department of Computer Science & Engineering, United International University, Bangladesh

Email: dewanfarid@cse.uiu.ac.bd

Abstract—Class imbalance classiﬁcation is a challenging re-

search problem in data mining and machine learning, as most

of the real-life datasets are often imbalanced in nature. Existing

learning algorithms maximise the classiﬁcation accuracy by cor-

rectly classifying the majority class, but misclassify the minority

class. However, the minority class instances are representing the

concept with greater interest than the majority class instances

in real-life applications. Recently, several techniques based on

sampling methods (under-sampling of the majority class and over-

sampling the minority class), cost-sensitive learning methods, and

ensemble learning have been used in the literature for classifying

imbalanced datasets. In this paper, we introduce a new clustering-

based under-sampling approach with boosting (AdaBoost) algo-

rithm, called CUSBoost, for effective imbalanced classiﬁcation.

The proposed algorithm provides an alternative to RUSBoost

(random under-sampling with AdaBoost) and SMOTEBoost (syn-

thetic minority over-sampling with AdaBoost) algorithms. We

evaluated the performance of CUSBoost algorithm with the state-

of-the-art methods based on ensemble learning like AdaBoost,

RUSBoost, SMOTEBoost on 13 imbalance binary and multi-class

datasets with various imbalance ratios. The experimental results

show that the CUSBoost is a promising and effective approach

for dealing with highly imbalanced datasets.

Keywords—Boosting; Class imbalance; Clustering; Ensemble

classiﬁer; Sampling; RUSBoost

I. INTRODUCTION

In machine learning (ML) for data mining (DM) appli-

cations, supervised learning (or classiﬁcation) is the process

of identifying new/ unknown instances employing classiﬁers

(or classiﬁcation algorithms) based on a group of instances

with known class membership (training data) [1], [2], [3], [4].

Often real-world data sets are multi-class, high-dimensional

and class-imbalanced, which fall-off the classiﬁcation accuracy

of many ML algorithms. Therefore, number of ensemble

classiﬁers with sampling techniques have been proposed for

classifying binary-class low-dimensional imbalanced data in

the last decade [5], [6], [7]. Ensemble classiﬁers use multiple

ML algorithms to improve the performance of individual clas-

siﬁers that combine multiple hypotheses to form an advance

hypothesis [3]. The sampling methods use under-sampling

(under-sampling of the majority class instances) and over-

sampling (over-sampling the minority class instances) tech-

niques to alter the original class distribution of imbalanced

data. The under-sampling methods with random sampling of

the majority class might suffer from the loss of potentially

useful training instances. On the other hand, over-sampling

with replacement doesn’t signiﬁcantly improve minority class

recognition and increase the likelihood of overﬁtting [8].

In real-world class imbalance data sets, the minority class

instances are outnumbered by the majority class instances.

However, the minority class instances are representing the

concept with greater interest than the majority class instances

[9]. The traditional ML for DM algorithms, such as decision

tree (DT) [1], [3], na

ıve Bayes (NB) classiﬁer [2], and k-

nearest neighbors (kNN) [1], build the classiﬁcation models

that maximise the classiﬁcation rate, but ignore the minor-

ity class. The most approved methods for dealing with the

class imbalance problems are sampling techniques, ensemble

methods, and cost-sensitive learning methods. The sampling

techniques (under-sampling and over-sampling) either remove

the majority class instances from the imbalanced data or add

the minority class instances into the imbalanced data to get

the balanced data. The ensemble methods such as Bagging

and Boosting are also widely used for classifying imbalanced

data. Basically, the ensemble methods use sampling technique

in each iteration. The cost-sensitive learning is also applied for

solving the class imbalance problems, which assigns different

cost of misclassiﬁcation errors for different classes. Usually,

high cost for the minority class and low cost for the majority

class. But, the classiﬁcation results are not stable in cost-

sensitive learning methods, as it is difﬁcult to get the accurate

misclassiﬁcation cost and different misclassiﬁcation cost might

result in different inductions.

The methods for dealing with class imbalance problems

are divided into two categories: (a) external methods and (b)

internal methods. The external methods are also known as

data balancing methods, which preprocess the imbalanced data

to get the balanced data. The internal methods modify the

existing learning algorithms for reducing their sensitiveness

to the class imbalance when learning from the imbalanced

data. In this paper, we present a new clustering-based under-

sampling approach with boosting (AdaBoost), called CUS-

Boost algorithm. We divide the imbalanced dataset into two

part: majority class instances and minority class instances.

Then, we cluster the majority class instances into several

clusters using k-means clustering algorithm and select the

majority class instances from each cluster to form a balanced

dataset, where the majority and minority class instances are

almost equal. Clustering helps us to group the majority class

instances in such a way that instances in the same cluster are

more similar to each other than to those in other clusters. So,

instead of randomly removing the majority class instances we

used clustering technique to select the majority class instances.

CUSBoost combines the sampling and boosting methods to

form an efﬁcient and effective algorithm for class imbalance

learning. We tested the performance of CUSBoost algorithm

IEEE 1 | P a g e

arXiv:1712.04356v1 [cs.LG] 12 Dec 2017

下载后可阅读完整内容，剩余5页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

gcl1997

粉丝: 0

CUSBoost: 分类不平衡问题的聚类下采样AdaBoost算法

matlab与VC实现adaboost算法的分类源代码

Adaboost算法：融合多个弱分类器形成强分类器

深入探索Adaboost算法在分类任务中的应用

adaboost.zip_adaboost_adaboost 分类_adaboost 算法_matlab adaboost_分类

adaboost算法分类

基于SVM-Adaboost算法的机器学习多分类预测模型：轴承、变压器、电力系统故障识别与分类算法研究,SVM-Adaboost 基于支持向量机的Adaboost数据多分类预测 故障识别算法，机器学习

通过AdaBoost算法提升分类准确率

Adaboost算法 弱分类器

lstm算法和AdaBoost算法融合分类模型

adaboost算法分类鸢尾花

最新资源

基于SVM-Adaboost算法的机器学习多分类预测模型：轴承、变压器、电力系统故障识别与分类算法研究,SVM-Adaboost 基于支持向量机的Adaboost数据多分类预测故障识别算法，机器学习

Adaboost算法弱分类器