使用阈值SMOTE和属性袋ging处理不平衡数据集

180 浏览量更新于2024-08-27 1 收藏 791KB PDF 举报

"这篇研究论文探讨了如何应用阈值SMOTE算法与属性集成（Attribute Bagging）来处理不平衡数据集的问题。作者提出了三种基于超网络的模型：集成成本敏感超网络（EN-CS-HN）、集成成本敏感超网络与欠采样（EN-CS-HN-UND）以及集成成本敏感超网络与合成少数类过采样技术（EN-CS-HN-SMOTE），以解决传统机器学习算法在处理类别不平衡问题时的偏差问题。通过在十个不平衡数据集上的实验，验证了这些方法的有效性。" 在实际的机器学习任务中，不平衡数据集是一个常见的挑战。不平衡数据指的是不同类别的样本数量差距悬殊，比如在一个二分类问题中，正类样本远远少于负类样本。这种情况会导致模型在训练过程中偏向于预测数量多的类别，即多数类，从而忽视少数类，降低了模型的预测准确性和泛化能力。超网络是一种受到生物分子网络启发的概率图形模型，它能发现多个属性之间的高阶关联。然而，像许多传统的机器学习算法一样，超网络在处理不平衡数据集时也会倾向于多数类，从而对少数类的预测效果不佳。针对这个问题，该论文提出了一种新的方法，即结合阈值SMOTE（Threshold SMOTE）算法和属性集成（Attribute Bagging）。阈值SMOTE是SMOTE（Synthetic Minority Over-sampling Technique）的一种变体，通过创建合成样本来增加少数类样本的数量，同时避免过拟合。属性集成则是一种策略，通过随机选取部分特征来构建子集，每个子集上训练一个模型，最后将这些模型集成，以提高模型的稳定性和泛化能力。论文中提出的三种超网络模型都考虑了成本敏感学习（Cost-Sensitive Learning）的概念，这意味着模型在训练时会根据错误分类不同类别的代价来调整权重，以减轻对少数类的忽视。EN-CS-HN引入了成本敏感机制；EN-CS-HN-UND结合了欠采样，减少多数类样本以平衡数据分布；EN-CS-HN-SMOTE则采用SMOTE过采样来增强少数类样本。通过在十个不平衡数据集上进行实验，这三种方法的性能得到了评估。实验结果表明，这些模型在保持整体分类性能的同时，显著提升了对少数类的识别能力，验证了所提方法的有效性。这对于实际应用中的不平衡数据集问题具有重要的理论和实践意义。

Ensemble of Cost-Sensitive Hypernetworks for

Class-Imbalance Learning

Jin Wang, Ping-li Huang, Kai-wei Sun, Bao-lin Cao, Rui Zhao

Chongqing Key Laboratory of Computational Intelligence

Chongqing University of Posts and Telecommunications

Chongqing 400065, PR China

wangjin@cqupt.edu.cn

Abstract—Hypernetwork is a probabilistic graphic model of

learning and memory inspired by biomolecular networks, which

is very useful for discovering higher-order correlations among

multiple attributes. However, as many traditional machine

learning algorithms, hypernetworks may bias towards the

majority class, thus producing poor predictive accuracy over the

minority class when learining with imbalacned datasets. In this

paper, three hypernetwork-based models, namely ensemble of

cost-sensitive hypernetworks (EN-CS-HN), ensemble of cost-

sensitive hypernetworks with under-sampling (EN-CS-HN-

UNDE), and ensemble of cost-sensitive hypernetworks with

synthetic minority over-sampling technique (EN-CS-HN-SMOTE)

are proposed respectively. To examine the performance of the

proposed schemes, we conduct experiments on ten imbalanced

datasets collected from UCI machine learning repository,

wherein the proposed methods are compared with various state-

of-the-art approaches using three metrics: G-Mean, F-Measure

and area under the receiver operating characteristic curve (AUC-

ROC). Experimental results show that the proposed methods are

able to surpass or match the previously known best algorithms on

most of the ten datasets.

Keywords-imbalanced classification; hypernetworks; ensemble

learning; cost-sensitive learning; under-sampling; SMOTE

I. INTRODUCTION

Imbalanced data classification is one of the leading

challenging problems in knowledge discovery and real-world

data mining [1]. It refers to the classification of datasets

wherein some classes have much fewer instances than other

classes. We assume that the positive class is the minority class,

and the negative class is the majority class. Class imbalance

has a serious impact on the performance of classifiers. When

learning from imbalanced datasets, traditional machine learning

algorithms usually produce high classification accuracy over

negative class while obtaining poor results over positive class.

For the past few years, several approaches have been

proposed for dealing with imbalanced data classification [1, 2].

The existing methods can be categorized into two fields: data-

oriented strategies and algorithms-related approaches. At the

data level, re-sampling strategies such as under-sampling [3]

and over-sampling [4] are extremely explored. Algorithm-

related approaches include ensemble learning [5, 6], cost-

sensitive learning [7] and so on. However, neither of these

methods alone can address the class imbalance problem

effectively. For example, the under-sampling strategy may lead

to information loss since many potential useful samples are

discarded. The over-sample strategy has the disadvantages of

long training time and overfitting when a lot of synthetic

samples are added. In most cases of cost-sensitive learning, the

misclassification costs are difficult to define.

Hypernetworks are a bio-inspired probabilistic graphical

model based on undirected graphs [8]. Generally speaking, a

hypernetwork is a hypergraph whose hyperedges are weighted.

Unlike common graph, an edge of which can only connect two

vertices at most, a hyperedge in a hypergraph can connect more

than two vertices. In this case, higher-order correlations of

vertices are explicitly represented in hyperedges. Up to now,

hypernetworks have been successfully used to solve various

machine learning problems [8, 9].

Hypernetworks assumes that the class distribution of

datasets is balanced. In the process of hypernetworks learning,

hyperedges which are critical for differentiating classes will be

copied and added, and hyperedges with poor distinguishing

ability will be discarded, aiming to extract hyperedges that can

cover as many samples as possible. However, within the

context of class-imbalance learning problem, most of samples

in the minority class are usually viewed as noises. Therefore,

the number of hyperedges corresponding to the majority class

significantly surpasses that of hyperedges corresponding to the

minority class. As a result, most of the minority samples are

misclassified in a traditional hypernetwork.

In this paper, a modified hypernetwork is proposed to deal

with the class imbalance problem in three ways:

1. Building a cost-sensitive hypernetwork model. By

assigning a higher misclassification cost to false negatives than

to false positives, hypernetworks are driven to focus on the

learning of the minority class, which develops the original

hypernetworks into a cost-sensitive one.

2. Introducing an ensemble strategy to the cost-sensitive

hypernetworks. In many cost-sensitive learning cases, the

actual misclassification cost information is generally

unavailable. In this paper, a genetic algorithm (GA)-based

This work was partially supported by the National Natural Science

Foundation of China (61203308, 61075019), and the Natural Science

Foundation Project of CQ CSTC under Grant No. cstc2013jcyjA40063, No.

cstc2012jjA40034.

2013 IEEE International Conference on Systems, Man, and Cybernetics

DOI

1883

2013 IEEE International Conference on Systems, Man, and Cybernetics

DOI 10.1109/SMC.2013.324

1883

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38711529

粉丝: 4
资源: 901

使用阈值SMOTE和属性袋ging处理不平衡数据集

ApplyingSupport VectorMachinestoImbalancedDatasets

Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and the Unified Process, Second Edition

Applying-Math-with-Python-master.zip

Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and Iterative Development (3rd Edition)

Applying Use Case Driven Object Modeling with UML

Applying the multi-category learning to multiple video object extraction

Applying the principles of social manufacturing to chemical process-related value chains

Applying Deep Learning To Answer Selection

Dealing with Imbalanced Data: 7 Strategies to Overcome the Challenge

【MATLAB Genetic Algorithm: From Beginner to Expert】: A Comprehensive Guide to Master Genetic ...

最新资源