大数据挖掘与KDD：机遇与技术挑战

需积分: 9 104 浏览量更新于2024-12-31 收藏 1.8MB PDF 举报

"《数据挖掘与知识发现：承诺与挑战》是Usama Fayyad撰写的一篇文章，发表在1997年的FutureGenerationComputerSystems杂志第13卷，99-115页。该论文探讨了随着数据库规模的急剧增长，传统数据分析和可视化方法已经无法应对时，数据挖掘（Data Mining）和知识发现（Knowledge Discovery in Databases, KDD）的重要性及所面临的挑战。数据挖掘技术源自统计学、模式识别、数据库管理、人工智能等多个领域，其目标是通过复杂的数据处理和分析，从海量数据中发现有价值的信息模式和规律。这些技术包括聚类分析、关联规则学习、分类、预测模型等，它们在市场趋势分析、客户行为理解、医疗诊断、金融风险评估等领域展现出巨大的潜力。文章首先概述了数据挖掘作为一个跨学科研究领域的快速发展，强调了它在商业智能、决策支持和知识管理中的核心作用。接着，作者详细介绍了几种关键的数据挖掘技术，并展示了它们在实际应用中的具体实例，如信用卡欺诈检测、电子商务推荐系统等。同时，文中着重讨论了高性能计算和并行计算在数据挖掘问题中的关键角色。随着数据量的增长，传统的单线程处理方式已难以满足实时性和效率的要求。并行和分布式计算提供了处理大规模数据的强大工具，如分布式数据库、MapReduce等，极大地提升了数据挖掘的性能和可扩展性。然而，尽管数据挖掘带来巨大机遇，但也面临诸多挑战。其中包括数据质量的保证（如噪声、缺失值处理）、模型解释性与可理解性的需求、隐私保护问题，以及如何将理论研究成果转化为实用的商业应用等。此外，随着技术的不断进步，如何保持算法的实时性、适应性和鲁棒性，以及如何处理不断变化的业务环境，都是当前和未来需要解决的关键问题。《数据挖掘与知识发现：承诺与挑战》这篇论文为我们提供了一个全面的视角，揭示了数据挖掘作为一种新兴技术的前景，以及在这个快速发展的领域中面临的诸多挑战。这对于研究人员、技术人员以及企业决策者来说，都是理解并利用数据价值的重要参考文献。"

102

U. Fayyad, I? Stolorz/Future Generation Computer Systems 13 (1997) 99-115

Fig. 3. An overview of the steps comprising the KDD process.

also gained popularity in the database field. The earli-

est uses of the term come from statistics and the usage

in most cases was associated with negative connota-

tions of blind exploration of data without a priori hy-

potheses to verify. However, notable exceptions can

be found. For example, as early as 1978, Learner [ 161

used the term in a positive sense in a demonstration of

how generalized linear regression can be used to solve

problems that were very difficult for humans and for

traditional statistical techniques of that time to solve.

The term KDD was coined at the first KDD workshop

in 1989 [ 191 to emphasize that “knowledge” is the end

product of a data-driven discovery.

In our view KDD refers to the overall process of

discovering useful knowledge from data while data

mining refers to a particular step in this process. Data

mining is the application of specific algorithms for ex-

tracting patterns from data. The additional steps in the

KDD process, such as data preparation, data selection,

data cleaning, incorporating appropriate prior knowl-

edge, and proper interpretation of the results of min-

ing, are essential to ensure that useful knowledge is

derived from the data. Blind application of data min-

ing methods (rightly criticized as “data dredging” in

the statistical literature) can be a dangerous activity

easily leading to discovery of meaningless patterns.

We give an overview of the KDD process in Fig. 3.

Note that in the KDD process, one typically iterates

many times over previous steps and the process is

fairly messy with plenty of experimentation. For ex-

ample, one may select, sample, clean, and reduce data

only to discover after mining that one or several of the

previous steps need to be redone. We have omitted ar-

rows illustrating these potential iterations to keep the

figure simple.

3.1. Basic definitions

We adopt the definitions of KDD and data mining

provided in [7, Chap. 11.

Knowledge discovery in databases is the nontriv-

ial process of identifying valid, novel, potentially

useful, and ultimately understandable patterns in

data.

Here data is a set of facts (e.g., cases in a database)

and pattern is an expression in some language repre-

senting a parsimonious description of a subset of the

data or a model applicable to that subset. Hence, in

our usage here, extracting a pattern also designates

fitting a model to data, finding structure from data, or

in general any high-level description of a set of data.

The term process implies that KDD is comprised of

many steps, which involve data preparation, search for

patterns, knowledge evaluation, and refinement, all re-

peated in multiple iterations. By nontrivial we mean

that some search or inference is involved, i.e., it is not

a straightforward computation of predefined quantities

like computing the average value of a set of numbers.

The discovered patterns should be valid on new data

with some degree of certainty. We also want patterns

to be novel (at least to the system, and preferably to

the user) and potentially useful, i.e., lead to some ben-

efit to the user/task. Finally, the patterns should be

understandable, if not immediately then after some

post-processing.

While it is possible to define quantitative measures

for certainty (e.g., estimated prediction accuracy on

new data) or utility (e.g., gain perhaps in dollars saved

due to better predictions or speed-up in response time

of a system), notions such as novelty and understand-

ability are much more subjective. In certain contexts

understandability can be estimated by simplicity (e.g.,

the number of bits to describe a pattern). An important

notion, called interestingness (e.g., see [20] and refer-

ences within), is usually taken as an overall measure

of pattern value, combining validity, novelty, useful-

ness, and simplicity. Interestingness functions can be

explicitly defined or can be manifested implicitly via

剩余16页未读，继续阅读

咬玉米

粉丝: 0
资源: 1

大数据挖掘与KDD：机遇与技术挑战

Data Mining and Knowledge Discovery Handbook

data-mining-conferences:排名，接受率，截止日期和发布提示

KDD：开源角色扮演游戏主控管理器

KDD:2021年夏季在弗里德里希-亚历山大大学埃尔兰根-纽伦堡分校举行的数据库知识发现讲座的讲义

Data Mining and Bioinformatics.ppt

关联规则的matlab代码-DataMining-ID2222:数据挖掘ID2222

基于逻辑的数据挖掘方法（Data Mining and Knowledge Discovery via Logic-Based Methods）

Data Mining Concepts-Methods and Applications

kdd2015:争取KDDCUP 2015

KDD2015:2015年KDD杯

最新资源