数据挖掘入门：英文原版概念与技术解析

5星 · 超过95%的资源需积分: 10 41 浏览量更新于2024-07-28 收藏 14.39MB PDF 举报

"Data Mining Concepts and Techniques 3rd edition" 是一本关于数据挖掘的英文原版书籍，由专家撰写，旨在介绍这个快速发展领域的基础知识和技术。本书覆盖了数据挖掘的定义、应用、技术以及预处理和数据仓库等相关主题。数据挖掘是通过对大量数据进行分析，发现隐藏在其中的有价值模式或知识的过程。书中第一章"Introduction"探讨了为何需要数据挖掘，解释了数据挖掘是什么，可以挖掘哪些类型的数据（如结构化、半结构化和非结构化数据），可以发现何种模式（如关联规则、聚类、分类等），并介绍了所用的技术，如机器学习、统计分析和人工智能。此外，章节还讨论了数据挖掘的应用领域，如市场营销、金融风险评估、医疗研究等，并提出了数据挖掘中的关键问题，例如数据质量、隐私保护和结果解释。第二章"Getting to Know Your Data"深入讨论了如何理解和描述数据。作者介绍了数据对象和属性类型，以及如何对数据进行基本的统计描述，如均值、方差等。可视化工具在这一阶段的作用也被强调，用于帮助理解数据分布和趋势。此外，还讲解了度量数据相似性和差异性的方法，这对于聚类算法和其他模式识别技术至关重要。第三章"Data Preprocessing"是数据挖掘流程中的重要步骤，包括数据清洗以去除噪声和不一致性，数据集成处理来自多个源的数据，数据减少以降低复杂性，以及数据转换和离散化以优化算法性能。这一章详细阐述了这些预处理过程的重要性和实施方法。第四章"Data Warehousing and Online Analytical Processing"聚焦于数据仓库的基本概念，它是数据挖掘的一个常见应用场景。书中详细介绍了数据仓库建模，特别是数据立方体和OLAP（在线分析处理）的概念，以及数据仓库的设计、使用和实现过程，强调了它们在决策支持系统中的作用。这本书涵盖了数据挖掘的全面知识，适合初学者和专业人士作为参考资料。通过阅读，读者将能够掌握数据挖掘的基础理论、实践技术和相关工具，从而在大数据时代有效地发现和利用信息。

Foreword

Christos Faloutsos

Carnegie Mellon University

Analyzing large amounts of data is a necessity. Even popular science books, like “super crunchers,” give compel-

ling cases where large amounts of data yield discoveries and intuitions that surprise even experts. Every enterprise

benefits from collecting and analyzing its data: Hospitals can spot trends and anomalies in their patient records,

search engines can do better ranking and ad placement, and environmental and public health agencies can spot pat-

terns and abnormalities in their data. The list continues, with cybersecurity and computer network intrusion detec-

tion; monitoring of the energy consumption of household appliances; pattern analysis in bioinformatics and phar-

maceutical data; financial and business intelligence data; spotting trends in blogs, Twitter, and many more. Storage

is inexpensive and getting even less so, as are data sensors. Thus, collecting and storing data is easier than ever

before.

The problem then becomes how to analyze the data. This is exactly the focus of this Third Edition of the book. Ji-

awei, Micheline, and Jian give encyclopedic coverage of all the related methods, from the classic topics of cluster-

ing and classification, to database methods (e.g., association rules, data cubes) to more recent and advanced topics

(e.g., SVD/PCA, wavelets, support vector machines).

The exposition is extremely accessible to beginners and advanced readers alike. The book gives the fundamental

material first and the more advanced material in follow-up chapters. It also has numerous rhetorical questions,

which I found extremely helpful for maintaining focus.

We have used the first two editions as textbooks in data mining courses at Carnegie Mellon and plan to continue

to do so with this Third Edition. The new version has significant additions: Notably, it has more than 100 citations

to works from 2006 onward, focusing on more recent material such as graphs and social networks, sensor net-

works, and outlier detection. This book has a new section for visualization, has expanded outlier detection into a

whole chapter, and has separate chapters for advanced methods—for example, pattern mining with top-k patterns

and more and clustering methods with biclustering and graph clustering.

Overall, it is an excellent book on classic and modern data mining methods, and it is ideal not only for teaching but

also as a reference book.

Foreword to Second Edition

We are deluged by data—scientific data, medical data, demographic data, financial data, and marketing data. People

have no time to look at this data. Human attention has become the precious resource. So, we must find ways to

automatically analyze the data, to automatically classify it, to automatically summarize it, to automatically discov-

er and characterize trends in it, and to automatically flag anomalies. This is one of the most active and exciting

areas of the database research community. Researchers in areas including statistics, visualization, artificial intelli-

gence, and machine learning are contributing to this field. The breadth of the field makes it difficult to grasp the

extraordinary progress over the last few decades.

Six years ago, Jiawei Han's and Micheline Kamber's seminal textbook organized and presented Data Mining. It her-

alded a golden age of innovation in the field. This revision of their book reflects that progress; more than half of the

references and historical notes are to recent work. The field has matured with many new and improved algorithms,

and has broadened to include many more datatypes: streams, sequences, graphs, time-series, geospatial, audio, im-

ages, and video. We are certainly not at the end of the golden age—indeed research and commercial interest in data

mining continues to grow—but we are all fortunate to have this modern compendium.

The book gives quick introductions to database and data mining concepts with particular emphasis on data analysis.

It then covers in a chapter-by-chapter tour the concepts and techniques that underlie classification, prediction, as-

sociation, and clustering. These topics are presented with examples, a tour of the best algorithms for each problem

class, and with pragmatic rules of thumb about when to apply each technique. The Socratic presentation style is

both very readable and very informative. I certainly learned a lot from reading the first edition and got re-educated

and updated in reading the second edition.

Jiawei Han and Micheline Kamber have been leading contributors to data mining research. This is the text they use

with their students to bring them up to speed on the field. The field is evolving very rapidly, but this book is a quick

way to learn the basic ideas, and to understand where the field is today. I found it very informative and stimulating,

and believe you will too.

Jim Gray

In his memory

Preface

The computerization of our society has substantially enhanced our capabilities for both generating and collecting

data from diverse sources. A tremendous amount of data has flooded almost every aspect of our lives. This explos-

ive growth in stored or transient data has generated an urgent need for new techniques and automated tools that can

intelligently assist us in transforming the vast amounts of data into useful information and knowledge. This has led

to the generation of a promising and flourishing frontier in computer science called data mining, and its various

applications. Data mining, also popularly referred to as knowledge discovery from data (KDD), is the automated

or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data

warehouses, the Web, other massive information repositories, or data streams.

This book explores the concepts and techniques of knowledge discovery and data mining. As a multidisciplinary

field, data mining draws on work from areas including statistics, machine learning, pattern recognition, database

technology, information retrieval, network science, knowledge-based systems, artificial intelligence, high-perform-

ance computing, and data visualization. We focus on issues relating to the feasibility, usefulness, effectiveness, and

scalability of techniques for the discovery of patterns hidden in large data sets. As a result, this book is not intended

as an introduction to statistics, machine learning, database systems, or other such areas, although we do provide

some background knowledge to facilitate the reader's comprehension of their respective roles in data mining. Rath-

er, the book is a comprehensive introduction to data mining. It is useful for computing science students, application

developers, and business professionals, as well as researchers involved in any of the disciplines previously listed.

Data mining emerged during the late 1980s, made great strides during the 1990s, and continues to flourish into the

new millennium. This book presents an overall picture of the field, introducing interesting data mining techniques

and systems and discussing applications and research directions. An important motivation for writing this book was

the need to build an organized framework for the study of data mining—a challenging task, owing to the extensive

multidisciplinary nature of this fast-developing field. We hope that this book will encourage people with different

backgrounds and experiences to exchange their views regarding data mining so as to contribute toward the further

promotion and shaping of this exciting and dynamic field.

Organization of the Book

Since the publication of the first two editions of this book, great progress has been made in the field of data mining.

Many new data mining methodologies, systems, and applications have been developed, especially for handling new

kinds of data, including information networks, graphs, complex structures, and data streams, as well as text, Web,

multimedia, time-series, and spatiotemporal data. Such fast development and rich, new technical contents make it

difficult to cover the full spectrum of the field in a single book. Instead of continuously expanding the coverage

of this book, we have decided to cover the core material in sufficient scope and depth, and leave the handling of

complex data types to a separate forthcoming book.

The third edition substantially revises the first two editions of the book, with numerous enhancements and a re-

organization of the technical contents. The core technical material, which handles mining on general data types,

is expanded and substantially enhanced. Several individual chapters for topics from the second edition (e.g., data

preprocessing, frequent pattern mining, classification, and clustering) are now augmented and each split into two

chapters for this new edition. For these topics, one chapter encapsulates the basic concepts and techniques while

the other presents advanced concepts and methods.

Chapters from the second edition on mining complex data types (e.g., stream data, sequence data, graph-structured

data, social network data, and multirelational data, as well as text, Web, multimedia, and spatiotemporal data) are

now reserved for a new book that will be dedicated to advanced topics in data mining. Still, to support readers in

learning such advanced topics, we have placed an electronic version of the relevant chapters from the second edi-

tion onto the book's web site as companion material for the third edition.

The chapters of the third edition are described briefly as follows, with emphasis on the new material.

Chapter 1 provides an introduction to the multidisciplinary field of data mining. It discusses the evolutionary path

of information technology, which has led to the need for data mining, and the importance of its applications. It ex-

amines the data types to be mined, including relational, transactional, and data warehouse data, as well as complex

data types such as time-series, sequences, data streams, spatiotemporal data, multimedia data, text data, graphs, so-

cial networks, and Web data. The chapter presents a general classification of data mining tasks, based on the kinds

of knowledge to be mined, the kinds of technologies used, and the kinds of applications that are targeted. Finally,

major challenges in the field are discussed.

Chapter 2 introduces the general data features. It first discusses data objects and attribute types and then intro-

duces typical measures for basic statistical data descriptions. It overviews data visualization techniques for various

kinds of data. In addition to methods of numeric data visualization, methods for visualizing text, tags, graphs, and

multidimensional data are introduced. Chapter 2 also introduces ways to measure similarity and dissimilarity for

various kinds of data.

Chapter 3 introduces techniques for data preprocessing. It first introduces the concept of data quality and then

discusses methods for data cleaning, data integration, data reduction, data transformation, and data discretization.

Chapter 4 and Chapter 5 provide a solid introduction to data warehouses, OLAP (online analytical processing), and

data cube technology. Chapter 4 introduces the basic concepts, modeling, design architectures, and general im-

plementations of data warehouses and OLAP, as well as the relationship between data warehousing and other data

generalization methods. Chapter 5 takes an in-depth look at data cube technology, presenting a detailed study of

methods of data cube computation, including Star-Cubing and high-dimensional OLAP methods. Further explora-

tions of data cube and OLAP technologies are discussed, such as sampling cubes, ranking cubes, prediction cubes,

multifeature cubes for complex analysis queries, and discovery-driven cube exploration.

Chapter 6 and Chapter 7 present methods for mining frequent patterns, associations, and correlations in large data

sets. Chapter 6 introduces fundamental concepts, such as market basket analysis, with many techniques for fre-

quent itemset mining presented in an organized way. These range from the basic Apriori algorithm and its vari-

ations to more advanced methods that improve efficiency, including the frequent pattern growth approach, frequent

pattern mining with vertical data format, and mining closed and max frequent itemsets. The chapter also discusses

pattern evaluation methods and introduces measures for mining correlated patterns. Chapter 7 is on advanced pat-

tern mining methods. It discusses methods for pattern mining in multilevel and multidimensional space, mining

rare and negative patterns, mining colossal patterns and high-dimensional data, constraint-based pattern mining,

and mining compressed or approximate patterns. It also introduces methods for pattern exploration and application,

including semantic annotation of frequent patterns.

Chapter 8 and Chapter 9 describe methods for data classification. Due to the importance and diversity of classific-

ation methods, the contents are partitioned into two chapters. Chapter 8 introduces basic concepts and methods for

classification, including decision tree induction, Bayes classification, and rule-based classification. It also discusses

model evaluation and selection methods and methods for improving classification accuracy, including ensemble

methods and how to handle imbalanced data. Chapter 9 discusses advanced methods for classification, including

Bayesian belief networks, the neural network technique of backpropagation, support vector machines, classification

using frequent patterns, k-nearest-neighbor classifiers, case-based reasoning, genetic algorithms, rough set theory,

and fuzzy set approaches. Additional topics include multiclass classification, semi-supervised classification, active

learning, and transfer learning.

Cluster analysis forms the topic of Chapter 10 and Chapter 11. Chapter 10 introduces the basic concepts and meth-

ods for data clustering, including an overview of basic cluster analysis methods, partitioning methods, hierarchical

methods, density-based methods, and grid-based methods. It also introduces methods for the evaluation of cluster-

ing. Chapter 11 discusses advanced methods for clustering, including probabilistic model-based clustering, clus-

tering high-dimensional data, clustering graph and network data, and clustering with constraints.

Chapter 12 is dedicated to outlier detection. It introduces the basic concepts of outliers and outlier analysis and

discusses various outlier detection methods from the view of degree of supervision (i.e., supervised, semi-super-

vised, and unsupervised methods), as well as from the view of approaches (i.e., statistical methods, proximity-based

methods, clustering-based methods, and classification-based methods). It also discusses methods for mining con-

textual and collective outliers, and for outlier detection in high-dimensional data.

Finally, in Chapter 13, we discuss trends, applications, and research frontiers in data mining. We briefly cover

mining complex data types, including mining sequence data (e.g., time series, symbolic sequences, and biological

sequences), mining graphs and networks, and mining spatial, multimedia, text, and Web data. In-depth treatment

of data mining methods for such data is left to a book on advanced topics in data mining, the writing of which

is in progress. The chapter then moves ahead to cover other data mining methodologies, including statistical data

mining, foundations of data mining, visual and audio data mining, as well as data mining applications. It discusses

data mining for financial data analysis, for industries like retail and telecommunication, for use in science and en-

gineering, and for intrusion detection and prevention. It also discusses the relationship between data mining and

recommender systems. Because data mining is present in many aspects of daily life, we discuss issues regarding

剩余791页未读，继续阅读

esi2012

粉丝: 0
资源: 1

数据挖掘入门：英文原版概念与技术解析

Oracle数据挖掘概念英文原版资料概览

《数据挖掘：概念与技术》第二版简介

第三版《数据挖掘：概念与技术》概览

数据挖掘概念与技术第三版英文原版ppt

数据挖掘概念与技术第三版（英文原版）

数据挖掘概念与技术第三版(英文原版)加中文版

数据挖掘电子书（含英文原版）

数据挖掘概念与技术（中文第2、3版中英文）

韩家炜数据挖掘概念与技术（第二版）中英文+课后习题答案中英文合集

第二版《数据挖掘概念与技术》详解

最新资源