边境SMOTE：一种不平衡数据集学习的新型过采样方法

数据上采样

5星 · 超过95%的资源需积分: 50 16 浏览量更新于2023-03-16 收藏 462KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"这篇论文介绍了在不平衡数据集上进行挖掘的重要性以及其在数据挖掘领域的广泛应用。面对不平衡数据集的问题，文章提出了合成少数类过采样技术（SMOTE）及其两个改进版：borderline-SMOTE1 和 borderline-SMOTE2。这三种方法主要针对处理分类任务中的类别不平衡问题，特别是提升少数类别的样本数量。实验表明，borderline-SMOTE方法在提高少数类别的真正例率（TP rate）和F值方面优于SMOTE和随机过采样方法。" 在数据挖掘领域，不平衡数据集是一个普遍存在的问题，它指的是在分类任务中，一个或多个类别的样本数量远少于其他类别，这可能导致学习模型偏向于多数类，忽视了少数类的信息。例如，在医疗诊断系统中，正常病例可能远多于异常病例，或者在信用卡欺诈检测中，欺诈交易占比极小。这样的不平衡可能导致模型的性能下降，因为它可能会错误地将大多数样本预测为数量较多的类别。 SMOTE（Synthetic Minority Over-sampling Technique）是一种用于解决这个问题的过采样方法。它通过在少数类样本之间创建新的合成样本来平衡数据集。SMOTE算法会找到一个少数类样本的k个最近邻，然后随机选择其中一个近邻与原样本之间的线性组合来生成新的合成样本，这样增加了少数类样本的数量，同时保持了数据分布的局部结构。然而，SMOTE并不总是能够优化边界区域的样本，即那些位于类别决策边界附近的少数类样本。为此，论文提出了borderline-SMOTE方法，分为borderline-SMOTE1和borderline-SMOTE2两种变体。这两种方法更专注于选择边界附近的少数类样本进行过采样，以更好地捕捉类别间的复杂关系，从而改善模型对这些关键样本的识别能力。实验结果表明，borderline-SMOTE方法在处理不平衡数据集时，能够提高模型对少数类的敏感性和精确性，从而提升整体的分类性能。特别是在提高真正例率（TP rate）和F值这两个评价指标上，borderline-SMOTE1和borderline-SMOTE2表现优于传统的SMOTE和简单的随机过采样。总结来说，SMOTE及其衍生的borderline-SMOTE方法是解决数据不平衡问题的有效工具，它们通过对少数类样本的智能增加，提高了机器学习模型在处理不平衡数据集时的分类效果。对于那些需要平衡不同类别权重以确保模型公平性和准确性的应用，这些技术具有极大的价值。

资源详情

资源推荐

D.S. Huang, X.-P. Zhang, G.-B. Huang (Eds.): ICIC 2005, Part I, LNCS 3644, pp. 878

–

887, 2005.

Borderline-SMOTE: A New Over-Sampling Method in

Imbalanced Data Sets Learning

Hui Han

, Wen-Yuan Wang

, and Bing-Huan Mao

Department of Automation, Tsinghua University, Beijing 100084, P. R. China

hanh01@mails.tsinghua.edu.cn

wwy-dau@mail.tsinghua.edu.cn

Department of Statistics, Central University of Finance and Economics,

Beijing 100081, P. R. China

maobinghuan@yahoo.com

Abstract. In recent years, mining with imbalanced data sets receives more and

more attentions in both theoretical and practical aspects. This paper introduces

the importance of imbalanced data sets and their broad application domains in

data mining, and then summarizes the evaluation metrics and the existing meth-

ods to evaluate and solve the imbalance problem. Synthetic minority over-

sampling technique (SMOTE) is one of the over-sampling methods addressing

this problem. Based on SMOTE method, this paper presents two new minority

over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in

which only the minority examples near the borderline are over-sampled. For the

minority class, experiments show that our approaches achieve better TP rate

and F-value than SMOTE and random over-sampling methods.

1 Introduction

There may be two kinds of imbalances in a data set. One is between-class imbalance,

in which case some classes have much more examples than others [1]. The other is

within-class imbalance, in which case some subsets of one class have much fewer

examples than other subsets of the same class [2]. By convention, in imbalanced data

sets, we call the classes having more examples the majority classes and the ones hav-

ing fewer examples the minority classes.

The problem of imbalance has got more and more emphasis in recent years. Imbal-

anced data sets exists in many real-world domains, such as spotting unreliable tele-

communication customers [3], detection of oil spills in satellite radar images [4],

learning word pronunciations [5], text classification [6], detection of fraudulent tele-

phone calls [7], information retrieval and filtering tasks [8], and so on. In these do-

mains, what we are really interested in is the minority class other than the majority

class. Thus, we need a fairly high prediction for the minority class. However, the

traditional data mining algorithms behaves undesirable in the instance of imbalanced

data sets, as the distribution of the data sets is not taken into consideration when these

algorithms are designed.

The structure of this paper is organized as follows. Section 2 gives a brief introduc-

tion to the recent developments in the domains of imbalanced data sets. Section 3

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余9页未读，立即下载

hongtao0206

粉丝: 0
资源: 1

会员权益专享

边境SMOTE：一种不平衡数据集学习的新型过采样方法

BorderlineSMOTE。java

基于Boder-line的SMOTE算法

Smote的matlab代码

对时间序列使用smote进行过采样 python实现

python如何运用smote对数据集中label为1的样本进行过采样

smote过采样 python

SMOTE采样Python代码实现

平衡数据时为什么要进行SMOTE采样

smote采样matlab代码

时间序列数据使用smote生成样本

SMOTE过采样算法

smote过采样 python 参数

geometric-smote 过采样原理

smote可以对有缺失的数据进行重采样吗

smote过采样matlab代码

如何将SMOTE补充后的数据保存在桌面

matlab中SMOTE过采样

python中导入数据并用Borderline-SMOTE，不分训练集和测试集

smote和Ensemble-based methods如何结合

Kmeans Smote过采样Python代码

会员权益专享

最新资源