![](https://csdnimg.cn/release/download_crawler_static/526187/bg1.jpg)
A data complexity analysis on imbalanced datasets and an alternative imbalance
recovering strategy
Cheng G. Weng
The University of Sydney
School of Information Technologies
Sydney NSW 2006, Australia
cheng@it.usyd.edu.au
Josiah Poon
The University of Sydney
School of Information Technologies
Sydney NSW 2006, Australia
josiah@it.usyd.edu.au
Abstract
The imbalance dataset problem arises in many domains,
such as web page search, scam sites d etection. In this pa-
per, we propose an a lternative re-sampling approach to deal
with imbalance datasets. We demonstrate this approach
with a concrete implementation and it has shown promis-
ing results when compared to other standard approaches
that deals with imbalance dataset. We have also performed
an analysis of the data complexity to help understand im-
balanced dataset, which has also shown to be a promising
approach.
1. Introduction
Learning from imbalanced datasets is an important re-
search topic because it has significant economic impact to
the society. A lot of our everyday effort has been spent to
resolve the minority but important situations, e.g. detecting
scam sites, intrusion detection to computer network, fraud
detection in insurance claim, cancer contributing genes.
The problem with learning from an imbalance dataset is
that the conventional machine lear ners will try to maximize
the overall accuracy on the assumption that the future dis-
tribution to be the same as the training data. This assump-
tion often leads to a learner that performs poorly on the rare
class. Many popular machine learning algorithms have been
tried to see how well they can cope with the imbalanced sit-
uation, e.g. C4.5 [3], Support Vector Machine (SVM) [1],
kNN [18], random forests [5], but none of them has found
to be superior over one another.
Imbalance resolving strategies can be categorized under
3 types: re-sampling, cost sensitive learning, and adjusting
algorithms to bias the rare class. In the re-sampling str at-
egy, it is eith er the instances are removed from the neg-
ative/majority class, termed under-sampling, or we over-
sample by replicatin g the p ositive/minority examples in the
training set to inform the classifier of the importance of
these examples from the minority class. The re-sampling
can be a random or heuristic driven process. There are
also approaches that introduce artificial points to the space
which is bounded by the existing examples. The main draw-
back of under-sampling approach is that valuable informa-
tion will disappear with the removed instances. The down-
side to an over-sampling approach is the increased compu-
tational time to p rocess more data, and if artificial points
are introduced, it m ay lead to greater noise or overfitting.
Despite the weaknesses, both re-sampling strategies have
shown to be helpful strategies. In cost sensitive learning,
we apply different weights to the training examples to in-
form the classifier about the cost of the misclassification o f
different examples, so that the classifier pays more atten-
tion to the minority target class. In this paper, we propose
an alternative imbalance resolv ing strategy that extends the
training examples to a richer space. This approach over-
comes the drawbacks of most re-sampling approaches.
Not every imbalance d ataset can pose a problem to learn-
ing, therefore, to understand more about imbalance datasets,
we have also investigated the relationship between the data
complexity and the performance of different imbalance
strategies.
The next section discuss the related work on imbalance
dataset, and follow b y a section on data complexity. We
describe our proposed alternative method in section 4, then
our experimental setup. The results and evaluation are pre-
sented in section 6. Lastly, we will conclude with some
future work.
2. Related Work
[7] applied two strategies to solve the data imbalance
problem in nosocomial infection, namely, re-sampling and
an asymmetrical margin SVM. Their re-sampling strategy
Proceedings of the 2006 IEEE/WIC/ACM International Conference
on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006