Abstract—Distribution of data stream is always changed in
the real world. This problem is usually defined as concept drift
[1]
. The state-of-the-art decision tree classification method
CVFDT
[2]
can solve the concept drift problem well, but the
efficiency is debased because of its general method of handling
instances in CVFDT without considering the types of concept
drift. In this paper, an algorithm called Efficient CVFDT
(E-CVFDT) is proposed to improve the efficiency of CVFDT.
E-CVFDT introduces cache mechanism and treats the
instances in three kinds of concept drift respectively, i.e.
accidental concept drift, gradual concept drift, instantaneously
concept drift. Besides, in E-CVFDT, the cached instances
which have similar attributes will be sent in batches to
calculate the information gain calculation rather than in
sequence adopted by CVFDT. The experiments are carried out
on the MOA platform. The results show that E-CVFDT
algorithm achieves not only better efficiency but also higher
accuracy than CVFDT algorithm.
I. INTRODUCTION
n recent years, with the rapid development of network and
information technology, the research of mining data
stream is becoming a new hot spot. Many organizations have
huge amount of data, commonly generated as a continuously
a sequence of examples from many different location, such as
Bank, Business, sensor network, and network event logs.
Concept drift data stream always appears in the real world,
such as the emergence of new network attacks, the changes of
user’s interests and the factors of economy change and so on.
The distribution of data stream is time-varying non-static.
Traditional classification method can’t address these drift
problem well. It is necessary to design and development some
fast, efficient and high accurate analysis techniques soon.
There are several methods in machine learning can deal with
concept drift
[8][9][10][11]
by using time windows or weighted
examples.
The paper presents an improved CVFDT method
E-CVFDT to address the concept drift problem. Frist, paper
discusses the three different scenes of concept drift
[4][5][6]
happening. They are accidental concept drift, gradual
concept drift and instantaneously concept drift. In order to
illustrate clearly, paper defines some parameters. Set an old
concept is C1, a new concept is C2, a slide window is W and
a user-supplied threshold is w.
This work was supported by the National Science Foundation (Grant
No.61133016), and the National High Technology Joint Research
Program of China (863 Program, Grant No. 2011AA010706).
In accidental concept drift scene, the C2 appears
probability is very small. The C2 examples may useless or
bad, but the traditional CVFDT method always make them
participate in information gain calculation with other
examples immediately. This causes bad performance
efficiency. To address this problem in this scene, the
proposed method E-CVFDT keeps the C2 examples in an
array, and drops it when it’s useless anymore. In gradual
concept drift scene, the C2 occurs evolving with C1, and they
alternately flow into classification model. After a short time,
C1 disappears and only C2 remains. In order to reduce the
performance efficiency of CVFDT in this scene, E-CVFDT
keeps these C2 examples in W. Until the |W|<w, E-CVFDT
sends these examples to participate in information gain
calculation. In instantaneously concept drift scene, the C1
disappear immediately when the C2 occurring, and there are
only C2 remains. In this scene, the traditional CVFDT
algorithm and E-CVFDT method achieve the similar
efficiency and accuracy in dealing with concept drift.
The paper implements this idea based on MOA
[3]
(Massive
Online Analysis) open source classification platform, and the
experiments results show that the E-CVFDT can achieve
better accuracy and efficiency than traditional CVFDT.
The paper discusses the traditional CVFDT algorithm
which is going to be improved and other decision tree
classification algorithm in Section II and describes the
performance of E-CVFDT in Section III. Finally experiments
this idea in Section IV.
II. R
ELATED WORK
Many researchers have made lots of works in mining data
stream, such as how to improve decision tree to deal with the
concept drift. VFDT is an incremental learning decision tree
algorithm, and CVFDT can deal with the concept drift in the
high speed data streams. They both use Greedy algorithm that
makes sprite decision top-down.
This paper discusses the state-of-the-art method CVFDT
and makes some researches with this algorithm. CVFDT is
the extension of VFDT. They have the similar performance
efficiency, and good accuracy. In order to deal with the
time-varying concept data, it creates a slide window to track
this change. When a new example S arrives, the CVFDT
increases the count number of the node which corresponding
to S and decreases the early example count numbers. If the
data distribution changes, CVFDT start to create a new
E-CVFDT: An Improving CVFDT Method for Concept Drift
Data Stream
Gang Liu, Hong-rong Cheng, Zhi-guang Qin, Qiao Liu, Cai-xia Liu
School of
Computer Science&Engineering,University of Electronic Science and Technology of China, Chengdu
I
315
978-1-4799-3051-7/13/$31.00 ©2013 IEEE