E-CVFDT：提升效率的CVFDT概念漂移处理算法

136 浏览量更新于2024-08-29 收藏 207KB PDF 举报

随着网络和信息技术的快速发展，数据流中的分布不断变化，这一现象通常被称为概念漂移，是现实世界中数据挖掘和机器学习面临的重要挑战。现有的决策树分类方法，如CVFDT（Concept Drifting Forest with Dynamic Trees）[2]，虽然能较好地应对概念漂移问题，但在处理实例时采用通用的方法，没有针对性地考虑不同类型的漂移，这导致了效率上的损失。本文提出了一种名为E-CVFDT（Efficient CVFDT）的新算法，旨在提高CVFDT在处理概念漂移数据流时的效率。E-CVFDT引入了缓存机制，并针对偶然性漂移（sudden drift）、渐进性漂移（gradual drift）和瞬时漂移（instantaneous drift）三种类型的漂移分别进行处理。与CVFDT按顺序处理每个实例不同，E-CVFDT会将具有相似属性的缓存实例分批发送进行信息增益计算，从而减少不必要的计算量。实验在MOA（ Massive Online Analysis）平台上进行，结果显示E-CVFDT算法不仅在效率上优于CVFDT，而且在准确率方面也有所提升。这表明E-CVFDT通过更精细的策略和优化设计，能够有效地适应数据流中的概念漂移，为实时和动态环境下的数据分类任务提供了更加高效和精确的解决方案。具体来说，E-CVFDT的关键创新包括： 1. **缓存机制**：通过存储近期实例，E-CVFDT能够在遇到类似场景时复用计算，减少了对新实例的重复处理，提高了响应速度。 2. **类型区分**：对不同类型的漂移采取不同的处理策略，如对于突发的、渐进的和瞬时的概念漂移，E-CVFDT能更快地调整决策树结构以适应变化。 3. **批量计算**：利用相似属性的实例批量计算信息增益，而非逐个实例处理，降低了计算复杂度，提升了整体性能。 4. **平台验证**：在MOA平台上进行的实验验证了E-CVFDT在实际应用中的优势，证明了其在处理概念漂移数据流时的实用性和有效性。 E-CVFDT通过引入缓存机制和针对性地处理不同类型的概念漂移，显著提升了在概念漂移数据流中的决策树分类算法性能，是IT领域在应对现实世界数据挑战时的一个重要进步。



Abstract—Distribution of data stream is always changed in

the real world. This problem is usually defined as concept drift

[1]

. The state-of-the-art decision tree classification method

CVFDT

[2]

can solve the concept drift problem well, but the

efficiency is debased because of its general method of handling

instances in CVFDT without considering the types of concept

drift. In this paper, an algorithm called Efficient CVFDT

(E-CVFDT) is proposed to improve the efficiency of CVFDT.

E-CVFDT introduces cache mechanism and treats the

instances in three kinds of concept drift respectively, i.e.

accidental concept drift, gradual concept drift, instantaneously

concept drift. Besides, in E-CVFDT, the cached instances

which have similar attributes will be sent in batches to

calculate the information gain calculation rather than in

sequence adopted by CVFDT. The experiments are carried out

on the MOA platform. The results show that E-CVFDT

algorithm achieves not only better efficiency but also higher

accuracy than CVFDT algorithm.

I. INTRODUCTION

n recent years, with the rapid development of network and

information technology, the research of mining data

stream is becoming a new hot spot. Many organizations have

huge amount of data, commonly generated as a continuously

a sequence of examples from many different location, such as

Bank, Business, sensor network, and network event logs.

Concept drift data stream always appears in the real world,

such as the emergence of new network attacks, the changes of

user’s interests and the factors of economy change and so on.

The distribution of data stream is time-varying non-static.

Traditional classification method can’t address these drift

problem well. It is necessary to design and development some

fast, efficient and high accurate analysis techniques soon.

There are several methods in machine learning can deal with

concept drift

[8][9][10][11]

by using time windows or weighted

examples.

The paper presents an improved CVFDT method

E-CVFDT to address the concept drift problem. Frist, paper

discusses the three different scenes of concept drift

[4][5][6]

happening. They are accidental concept drift, gradual

concept drift and instantaneously concept drift. In order to

illustrate clearly, paper defines some parameters. Set an old

concept is C1, a new concept is C2, a slide window is W and

a user-supplied threshold is w.

This work was supported by the National Science Foundation (Grant

No.61133016), and the National High Technology Joint Research

Program of China (863 Program, Grant No. 2011AA010706).

In accidental concept drift scene, the C2 appears

probability is very small. The C2 examples may useless or

bad, but the traditional CVFDT method always make them

participate in information gain calculation with other

examples immediately. This causes bad performance

efficiency. To address this problem in this scene, the

proposed method E-CVFDT keeps the C2 examples in an

array, and drops it when it’s useless anymore. In gradual

concept drift scene, the C2 occurs evolving with C1, and they

alternately flow into classification model. After a short time,

C1 disappears and only C2 remains. In order to reduce the

performance efficiency of CVFDT in this scene, E-CVFDT

keeps these C2 examples in W. Until the |W|<w, E-CVFDT

sends these examples to participate in information gain

calculation. In instantaneously concept drift scene, the C1

disappear immediately when the C2 occurring, and there are

only C2 remains. In this scene, the traditional CVFDT

algorithm and E-CVFDT method achieve the similar

efficiency and accuracy in dealing with concept drift.

The paper implements this idea based on MOA

[3]

(Massive

Online Analysis) open source classification platform, and the

experiments results show that the E-CVFDT can achieve

better accuracy and efficiency than traditional CVFDT.

The paper discusses the traditional CVFDT algorithm

which is going to be improved and other decision tree

classification algorithm in Section II and describes the

performance of E-CVFDT in Section III. Finally experiments

this idea in Section IV.

II. R

ELATED WORK

Many researchers have made lots of works in mining data

stream, such as how to improve decision tree to deal with the

concept drift. VFDT is an incremental learning decision tree

algorithm, and CVFDT can deal with the concept drift in the

high speed data streams. They both use Greedy algorithm that

makes sprite decision top-down.

This paper discusses the state-of-the-art method CVFDT

and makes some researches with this algorithm. CVFDT is

the extension of VFDT. They have the similar performance

efficiency, and good accuracy. In order to deal with the

time-varying concept data, it creates a slide window to track

this change. When a new example S arrives, the CVFDT

increases the count number of the node which corresponding

to S and decreases the early example count numbers. If the

data distribution changes, CVFDT start to create a new

E-CVFDT: An Improving CVFDT Method for Concept Drift

Data Stream

Gang Liu, Hong-rong Cheng, Zhi-guang Qin, Qiao Liu, Cai-xia Liu

School of

Computer Science&Engineering,University of Electronic Science and Technology of China, Chengdu

tany.liu.gang@gmail.com

315

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38553478

粉丝: 7
资源: 924

E-CVFDT：提升效率的CVFDT概念漂移处理算法

pgbreeze软件 方便用于编辑html文件和源代码

FP-ELM：一种用于处理概念漂移的在线顺序学习算法

matlab滤除直流分量代码-ECG-Detection-Algorithm:一种用于实时检测ECG数据集中QRS波群的算法

d3-discriminative-drift-detector-concept-drift:无监督概念漂移检测

ConceptDrift-data:概念漂移实验的数据集

aet-calc:基于浏览器的心率漂移计算器，可确定有氧阈（AeT）

stream-learn:stream-learn 是一个用于困难数据流分析的开源 Python 库

概率密度函数非参数估计matlab代码-Mean-Shift-Segmentation-using-Python:执行均值漂移分割以跟踪图像序

MiDiPSA-for-non-stationary-streams:MiDiPSA用于非平稳流

drift-db:用于漂移的数据库接口

最新资源

pgbreeze软件方便用于编辑html文件和源代码