处理不平衡数据集的复杂性分析与新型恢复策略

Datasets

Imbalanced

需积分: 3 160 浏览量更新于2023-07-24 收藏 126KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"这篇论文《不平衡数据集上的数据复杂性分析及替代的不平衡恢复策略》由Cheng G. Weng和Josiah Poon撰写，来自悉尼大学的信息技术学院。研究聚焦于解决在诸如网页搜索、欺诈网站检测等领域普遍存在的不平衡数据集问题。作者提出了一种新的重采样方法，并通过具体实现证明了其相对于传统方法的有效性。同时，他们还进行了数据复杂性的分析，以更好地理解不平衡数据集的本质，这种方法也显示出了潜在的价值。" 在机器学习和数据分析领域，不平衡数据集是一个常见的挑战。这种情况下，一个类别的样本数量远多于另一个类别，导致模型在预测时往往偏向于多数类，而忽视少数类。例如，在欺诈检测中，正常交易可能远多于欺诈交易，这使得学习模型很难正确识别出罕见的欺诈行为。本文中，作者提出的替代重采样策略旨在改善这种不平衡情况。重采样是一种常见的处理手段，包括过采样（增加少数类样本）和欠采样（减少多数类样本）。然而，传统的重采样方法可能存在一些问题，如过采样可能导致噪声增强，欠采样则可能丢失重要信息。作者的创新之处在于提出了一种新的平衡策略，通过调整采样方法，既能保持数据多样性，又不引入额外的噪声或丢失关键信息，从而提高模型对少数类的识别能力。此外，论文还进行了数据复杂性的分析，这是理解不平衡数据集的关键。数据复杂性分析可以帮助研究人员识别数据中的模式、结构和潜在的非线性关系，这对于设计有效的分类算法至关重要。通过这样的分析，可以更深入地理解为什么不平衡数据会导致学习困难，以及如何优化模型以适应这些复杂的特性。该论文提供了一个有前景的解决方案，对于处理不平衡数据集问题具有实践意义，同时也为未来的研究提供了理论基础。其贡献在于提出了一种新的重采样方法，以及通过数据复杂性分析来增强我们对不平衡数据集的理解。这些方法和见解对于提升机器学习在现实世界问题中的性能，特别是那些涉及到重要少数类的场景，具有重要的指导价值。

资源详情

资源推荐

A data complexity analysis on imbalanced datasets and an alternative imbalance

recovering strategy

Cheng G. Weng

The University of Sydney

School of Information Technologies

Sydney NSW 2006, Australia

cheng@it.usyd.edu.au

Josiah Poon

The University of Sydney

School of Information Technologies

Sydney NSW 2006, Australia

josiah@it.usyd.edu.au

Abstract

The imbalance dataset problem arises in many domains,

such as web page search, scam sites d etection. In this pa-

per, we propose an a lternative re-sampling approach to deal

with imbalance datasets. We demonstrate this approach

with a concrete implementation and it has shown promis-

ing results when compared to other standard approaches

that deals with imbalance dataset. We have also performed

an analysis of the data complexity to help understand im-

balanced dataset, which has also shown to be a promising

approach.

1. Introduction

Learning from imbalanced datasets is an important re-

search topic because it has signiﬁcant economic impact to

the society. A lot of our everyday effort has been spent to

resolve the minority but important situations, e.g. detecting

scam sites, intrusion detection to computer network, fraud

detection in insurance claim, cancer contributing genes.

The problem with learning from an imbalance dataset is

that the conventional machine lear ners will try to maximize

the overall accuracy on the assumption that the future dis-

tribution to be the same as the training data. This assump-

tion often leads to a learner that performs poorly on the rare

class. Many popular machine learning algorithms have been

tried to see how well they can cope with the imbalanced sit-

uation, e.g. C4.5 [3], Support Vector Machine (SVM) [1],

kNN [18], random forests [5], but none of them has found

to be superior over one another.

Imbalance resolving strategies can be categorized under

3 types: re-sampling, cost sensitive learning, and adjusting

algorithms to bias the rare class. In the re-sampling str at-

egy, it is eith er the instances are removed from the neg-

ative/majority class, termed under-sampling, or we over-

sample by replicatin g the p ositive/minority examples in the

training set to inform the classiﬁer of the importance of

these examples from the minority class. The re-sampling

can be a random or heuristic driven process. There are

also approaches that introduce artiﬁcial points to the space

which is bounded by the existing examples. The main draw-

back of under-sampling approach is that valuable informa-

tion will disappear with the removed instances. The down-

side to an over-sampling approach is the increased compu-

tational time to p rocess more data, and if artiﬁcial points

are introduced, it m ay lead to greater noise or overﬁtting.

Despite the weaknesses, both re-sampling strategies have

shown to be helpful strategies. In cost sensitive learning,

we apply different weights to the training examples to in-

form the classiﬁer about the cost of the misclassiﬁcation o f

different examples, so that the classiﬁer pays more atten-

tion to the minority target class. In this paper, we propose

an alternative imbalance resolv ing strategy that extends the

training examples to a richer space. This approach over-

comes the drawbacks of most re-sampling approaches.

Not every imbalance d ataset can pose a problem to learn-

ing, therefore, to understand more about imbalance datasets,

we have also investigated the relationship between the data

complexity and the performance of different imbalance

strategies.

The next section discuss the related work on imbalance

dataset, and follow b y a section on data complexity. We

describe our proposed alternative method in section 4, then

our experimental setup. The results and evaluation are pre-

sented in section 6. Lastly, we will conclude with some

future work.

2. Related Work

[7] applied two strategies to solve the data imbalance

problem in nosocomial infection, namely, re-sampling and

an asymmetrical margin SVM. Their re-sampling strategy

Proceedings of the 2006 IEEE/WIC/ACM International Conference

on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余6页未读，立即下载

yu_lier

粉丝: 0
资源: 6

会员权益专享

处理不平衡数据集的复杂性分析与新型恢复策略

Python: End-to-end Data Analysis.azw3电子书下载

源码+书Hands-on Data Structures and Algorithms with JavaScript

Complexity and Cryptography An Introduction

an iterative solver was used for this model. however, a direct solver may en

snmp-agent community complexity-check disable

用英语说说为什么选择从事数据工程师，工作中最美好的部分是什么，最糟糕的部分是什么

write a Python program which can make Nth order polynomial fit with an example

Pitfalls and Tradeoffs in Simultaneous, On-Chip FPGA Delay Measurement

请把这篇文献《Accelerating Similarity-Based Model Matching Using On-The-Fly Similarity Preserving Hashing》翻译成中文

write an essay of extended definition on the topic "harmony"

为什么会存在select in (1,2,3)比select in (1,2,3,4,5)查询速度要快的情况

Please use the method in DSelection to find the median of medians recursively on the following array A in groups of 5.

fpga ad7380

the checkerboard must be asymmetric: one side should be even, and the other

openGuass全局索引如何vacuum

debezium-connector-oracle

会员权益专享

最新资源