概念漂移适应：在线学习的挑战与策略

需积分: 24 174 浏览量更新于2024-07-16 收藏 731KB PDF 举报

"这篇文档是《771-A Survey on Concept Drift Adaptation.pdf》，主要探讨了在在线监督学习环境中，当输入数据与目标变量之间的关系随时间变化时的概念漂移问题。文章由几位专家撰写，包括来自葡萄牙波尔图大学的João Gama、芬兰阿尔托大学的Indre Žliobaitė、西班牙雅虎研究巴塞罗那的Albert Bifet、荷兰埃因霍温科技大学的Mykola Pechenizkiy以及英国伯恩茅斯大学的Abdelhamid Bouchachia。本文档深入介绍了适应性学习过程、概念漂移处理策略的分类、代表性算法和技术、适应性算法的评估方法，并提供了若干应用实例。它的目的是为研究人员、行业分析师和实践者提供概念漂移领域的最新技术和基准测试，旨在整合现有的零散的最新研究成果。" 本文档重点涵盖了以下几个关键知识点： 1. **概念漂移（Concept Drift）**：这是指在数据流挖掘或在线学习中，随着时间的推移，输入特征与输出目标之间的统计关系发生变化的现象。这种变化可能导致模型的预测性能下降，因为它基于的是过去的数据分布，而无法有效应对新的数据模式。 2. **在线监督学习（Online Supervised Learning）**：与离线学习不同，在线学习中，模型不断地接收新样本并立即更新，这使得它能够适应环境变化。在概念漂移的情况下，模型必须有能力快速调整以适应新关系。 3. **适应性学习过程（Adaptive Learning Process）**：这是一种动态的学习策略，允许模型根据新数据的反馈不断调整其结构和参数。它强调了学习系统需要具备自适应能力，以应对数据分布的变化。 4. **处理概念漂移的策略**：文章对这些策略进行了分类，可能包括早期检测、重采样、模型重训练、集成学习等方法。每种策略都有其独特性和适用场景。 5. **代表性算法和技术**：文档讨论了一些流行的方法，如Adaptive Random Forests、Ensemble Drift Detection Method (EDDM)、Hoeffding Trees和Adaptive Learning Machines等，它们在处理概念漂移方面表现突出。 6. **评估方法**：评估适应性算法的性能是一个挑战，因为需要考虑实时性能和处理漂移的能力。文章可能会涵盖像Drift Detection Method (DDM)、Kappa统计量、窗口基准等评价指标。 7. **应用示例**：为了更好地理解这些技术的实际效果，文档可能包含了各种领域的应用案例，如金融交易、网络入侵检测、社交媒体分析等。 8. **基准测试**：为促进进一步的研究，文章提供了一组基准测试数据集，供研究者测试和比较不同的概念漂移适应算法。通过这个调查，读者将获得一个全面的理解，如何在不断变化的数据环境中设计和评估适应性强的机器学习模型，这对于现代AI和数据分析领域至关重要。

1:8 J. Gama et al.

but still abrupt switch to burning. One challenge for learning is that the feedback (the

ground truth of mass ﬂow) is not available at all, it can only be approximately esti-

mated by retrospectively inspecting the historical data. An additional challenge is to

deal with speciﬁc one-sided outliers that can be easily mistaken for changes.

Traditional approaches (such as ADWIN) for explicit change detection based on the

monitoring of the raw sensor signal or streaming error of the regressors give reason-

able results. They can be improved by considering the peculiarities of the application.

2.5.2. Management and strategic planning. The Smart Grid (SG) is an electric system that

uses two-way digital information, cyber-secure communication technologies, and com-

putational intelligence in an integrated fashion across heterogeneous and distributed

electricity generation, transmission, distribution and consumption to achieve energy

efﬁciency. A key and novel characteristic of SG’s is the intelligent layer that analyzes

the data produced by smart meters allowing companies to develop powerful new capa-

bilities in terms of grid management, planning and customer services for energy efﬁ-

ciency. The advent of SG’s has changed the way energy is produced, priced and billed.

The key aspect of SG’s is distributed energy production, namely renewable energies.

The penetration of renewable energies (solar, wind, etc.) is increasing fast and power

forecasting becomes an important factor in deﬁning the operation planning policies to

be adopted by a Transmission System Operator.

When observing the literature in wind power prediction [Monteiro et al. 2009],

one realizes that most proposals are based on an off-line training mode, building a

static model that is then used to produce predictions. This option rely in assump-

tions of stationarity of the wind electric power model, which must be strongly ques-

tioned [Bremnes 2004; Bessa et al. 2009]. Using real data from three distinct wind

parks, [Bessa et al. 2009] presents the merits of on-line training against off-line train-

ing of neural networks. The authors point out the evolving nature of data and the

presence of concept drift in wind pattern behavior.

2.5.3. Personal assistance and information. Text classiﬁcation has been a popular topic in

machine learning for decades. However, interesting applications related to the problem

of concept drift appeared relatively recently. Examples of text stream applications in-

clude e-mail classiﬁcation [Carmona-Cejudo et al. 2010], e-mail spam detection [Lind-

strom et al. 2010] and sentiment classiﬁcation [Bifet and Frank 2010]. Sentiment clas-

siﬁcation is a popular task in social media monitoring, customer feedback analysis and

other applications.

The main source of concept drift in e-mail classiﬁcation and spam ﬁltering are due to

changing e-mail content and presentation (virtual drift), as well as adaptive behaviour

of spammers trying to overcome spam ﬁlters (may be virtual or real). Besides, users

may change their attitude towards particular categories of e-mails starting or stopping

to consider them spam (real drift). In sentiment classiﬁcation the vocabulary used to

express positive and negative sentiments may change over time. Since the collection

of documents is not static (virtual drift, novelties), the feature space representing the

current collection is dynamic that may require speciﬁc updates of the models.

Various adaptive learning strategies have been used in this domain, including indi-

vidual methods like case-based reasoning, and ensembles, either evolving or with an

explicit detection of changes by means of change detectors (Section 3.2).

Availability of feedback is a serious challenge in personal assistance and informa-

tion. The dilemma is that if feedback is easily available, that implies no need for au-

tomated predictions. In e-mail classiﬁcation we can hope that from time to time we

will receive feedback from the user in case of misclassiﬁcations or can design an active

learning system (e.g. [Zliobaite et al. 2013]), which from time to time asks the user to

ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: January 2013.

A Survey on Concept Drift Adaptation 1:9

provide labels on demand. However, when possible we need to aim at automatic ways

for obtaining the true labels.

Suppose for monitoring the attitude of people towards a political party we want to

classify the polarity or sentiment of tweets from Twitter. Labelling tweets manually

as positive or negative is a laborious and expensive task. However, tweets may have

author-provided sentiment indicators: changing sentiment is implicit in the use of var-

ious types of emoticons. Hence we may use these to label the training data. Smileys

or emoticons are visual cues that are associated with emotional states. They are con-

structed using the characters available on a standard keyboard, representing a facial

expression of emotion. By using emoticons, authors of tweets annotate their own text

with an emotional state. Annotated tweets can be used to train a sentiment classiﬁer.

Building a content-based ﬁlter for adaptive news access present rather different per-

spective on text classiﬁcation in streaming settings. The goal is to learn incrementally

and keep up to date a user model for news story classiﬁcation. A simple yet effective

approach has been proposed in [Billsus and Pazzani 2000]. For each user an adap-

tive learning system is built consisting of a simple ensemble with separate models for

short-term and long-term interests of users. A stable Naive Bayes classiﬁer is used for

modelling the long term interests of a user and the Nearest Neighbour classiﬁer cap-

tures the short term interests of the user. For the short-term interests model a ﬁxed

size window over the liked news stories is maintained and/or instances are weighed

with respect to their age. No explicit change detection is used for monitoring either of

the short-term or long-term interests. The true labels of some of the instances come

naturally due to a positive relevance feedback, i.e. a user accessing a particular news

item provides the signal that the item is relevant to his or her interests.

On the other hand, recommender systems is a broad application in the personal as-

sistance and information category [Bobadilla et al. 2013; Adomavicius and Tuzhilin

2005]. Interests of the data mining community in recommender systems domain have

been boosted by the NetFlix competition

. One of the lessons learnt by the winning

teams was that taking temporal dynamics into account substantially contributes to-

wards building accurate models. Modelling user interests and handling concept drift

were the other interesting aspects. In collaborative ﬁltering, modelling of user inter-

ests relies primarily on the availability of other ratings already provide by the users.

In a realistic application case, the data is highly imbalanced. Some movies are very

popular, while most of the movies are not; some users rate many movies, but many

other rate only a few. The rating matrix is high-dimensional and extremely sparse

containing only about 1% of non-zero elements. Such properties make the application

of most supervised learning techniques inapplicable and motivate the development

of advanced collaborative ﬁltering approaches. The sources and the nature of change

can be diverse. Both items and users are changing over time. The item-side effects in-

clude ﬁrst of all changing product perception and popularity. Popularity of some movies

is expected to follow seasonal patterns. The user-side effects include changing tastes

and preferences of users, some of which may be short-term or contextual and therefore

likely reoccurring (mood, activity, company), changing perception of rating scale, possi-

ble change of rater within household and alike problems. The winning team developed

an ensemble approach including multiple models for handling these various kinds of

changes. As suggested in [Koren 2010] popular windowing and instance weighing ap-

proaches for handling concept drift are not the best choice for each kind of changing

behaviour, simply because in collaborative ﬁltering e.g. the relations between ratings

is of the main importance for predictive modelling.

www.netflixprize.com

ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: January 2013.

剩余43页未读，继续阅读

hywcxq

粉丝: 0
资源: 33

概念漂移适应：在线学习的挑战与策略

ConceptDrift-data:概念漂移实验的数据集

Visual-Inertial Monocular SLAM with Map Reuse.pdf

react-native-drift：Drift.com平台的React Native包装器:link:

random-drift:从 code.google.comprandom-drift 自动导出

d3-discriminative-drift-detector-concept-drift:无监督概念漂移检测

APS011_Sources-of-Error-in-Two-Way-Ranging-Schemes_v1.1.pdf

大学生-微生物-期末复习名词解释排序版.pdf

GD10 0-20N_薄型软性压力传感器.PDF

Python库 | python_drift-0.6.1-py3-none-any.whl

Random-Drift-Method.rar_drift

最新资源