大数据挖掘入门教材：Web与大规模数据分析

需积分: 9 182 浏览量更新于2024-07-19 收藏 2.86MB PDF 举报

《大规模数据挖掘》(Mining of Massive Datasets)是由Jure Leskovec、Anand Rajaraman和Jeffrey D. Ullman合著的一本权威的免费教材，专注于大数据时代的数据挖掘技术。该书源于斯坦福大学多年来的教学实践，最初是为研究生高级课程“Web Mining”设计，但其内容逐渐普及到高级本科生，随着作者阵容的扩大，课程内容也进行了扩展。书中核心关注的是处理海量数据（即超出常规内存容量的数据）的数据挖掘。作者们结合网络分析和大型数据挖掘项目，将这些课程材料融入《大规模数据挖掘》一书中。课程包括CS224W（网络分析）、CS345A/CS246（Web Mining的升级版本）以及CS341（大型数据挖掘项目课程）。书中涵盖了数据挖掘的基本原理、算法和技术，以及如何在实际场景中应用它们来从互联网数据、社交网络数据等大规模数据集中提取有价值的信息和知识。主要内容涵盖了以下几个方面： 1. **数据挖掘概述**：介绍数据挖掘的概念，以及它在现代信息技术中的重要性，特别是在大数据背景下，如何通过分析海量数据来发现模式、关联和趋势。 2. **数据获取与存储**：讨论如何从互联网和其它大型数据源收集数据，并介绍分布式存储系统，如Hadoop MapReduce，用于处理和管理大规模数据。 3. **数据预处理**：讲解数据清洗、集成、转换和规约的过程，以提高数据质量并使其适合后续的分析。 4. **频繁模式挖掘**：探讨Apriori算法等方法，用于识别购物篮分析中的关联规则，以及在社交网络中发现用户行为模式。 5. **聚类分析**：介绍K-means算法和其他聚类技术，如何根据数据的相似性自动组织数据点。 6. **分类与回归**：涉及决策树、朴素贝叶斯、支持向量机等算法，用于预测和分类任务，尤其是在文本分类和情感分析中。 7. **网络分析**：重点讲解图论在理解社交网络、推荐系统和信息传播等方面的应用。 8. **流数据处理**：针对实时数据流的特殊挑战，介绍了窗口模型和实时计算框架。 9. **案例研究**：书中包含众多实际案例，展示了如何在电子商务、社交网络、搜索引擎优化等领域进行数据驱动的决策和策略制定。 10. **大型项目课程实践**：通过CS341课程，读者有机会参与到实际的数据挖掘项目中，提升实践技能。《大规模数据挖掘》是一本既理论深入又实践导向的教材，不仅适合研究生学习，也是数据科学和机器学习领域的宝贵参考资料，帮助读者掌握在海量数据中挖掘潜在价值的关键技术。

xvi CONTENTS

10.6.3 Exercises for Section 10.6 . . . . . . . . . . . . . . . . . . 380

10.7 Counting Triangles . . . . . . . . . . . . . . . . . . . . . . . . . . 380

10.7.1 Why Count Triangles? . . . . . . . . . . . . . . . . . . . . 380

10.7.2 An Algor ithm for Finding Triangles . . . . . . . . . . . . 381

10.7.3 Optimality of the Triangle-Finding Algorithm . . . . . . . 382

10.7.4 Finding Triangles Using MapReduce . . . . . . . . . . . . 383

10.7.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 384

10.7.6 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 385

10.8 Neighborhood Properties of Graphs . . . . . . . . . . . . . . . . . 386

10.8.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 386

10.8.2 The Diameter o f a Graph . . . . . . . . . . . . . . . . . . 38 8

10.8.3 Transitive C losure and Reachability . . . . . . . . . . . . 389

10.8.4 Transitive C losure Via MapReduce . . . . . . . . . . . . . 390

10.8.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 392

10.8.6 Transitive C losure by Graph Reduction . . . . . . . . . . 393

10.8.7 Approximating the Sizes of Neighborhoods . . . . . . . . 395

10.8.8 Exercises for Section 10.8 . . . . . . . . . . . . . . . . . . 397

10.9 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 398

10.10References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 402

11 Dimensionality Reduction 405

11.1 Eigenvalues and Eigenvectors of Symmetric Matrices . . . . . . . 406

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 406

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 407

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 408

11.1.4 The Matr ix of Eigenvectors . . . . . . . . . . . . . . . . . 411

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 411

11.2 Principal-Component Analysis . . . . . . . . . . . . . . . . . . . 412

11.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 413

11.2.2 Using Eigenvectors for Dimensionality Reduction . . . . . 416

11.2.3 The Matr ix of Distances . . . . . . . . . . . . . . . . . . . 417

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 418

11.3 Singular-Value Decompositio n . . . . . . . . . . . . . . . . . . . . 418

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 418

11.3.2 Interpretation of SVD . . . . . . . . . . . . . . . . . . . . 420

11.3.3 Dimensionality Re duction Using SVD . . . . . . . . . . . 422

11.3.4 Why Zeroing Low Singular Values Works . . . . . . . . . 423

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 425

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 426

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 427

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 428

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 429

11.4.2 Choosing Rows and Columns Properly . . . . . . . . . . . 430

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 431

11.4.4 The Complete CUR Decompo sition . . . . . . . . . . . . 43 2

CONTENTS xvii

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 433

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 434

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 434

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 436

12 Large-Scale Machine Learning 439

12.1 The Machine-Learning Model . . . . . . . . . . . . . . . . . . . . 44 0

12.1.1 Training Sets . . . . . . . . . . . . . . . . . . . . . . . . . 44 0

12.1.2 Some Illustrative Examples . . . . . . . . . . . . . . . . . 440

12.1.3 Approaches to Machine Learning . . . . . . . . . . . . . . 44 3

12.1.4 Machine-Learning Architecture . . . . . . . . . . . . . . . 444

12.1.5 Exercises for Section 12.1 . . . . . . . . . . . . . . . . . . 447

12.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

12.2.1 Training a Perceptron with Zero Threshold . . . . . . . . 447

12.2.2 Convergence of Perceptrons . . . . . . . . . . . . . . . . . 451

12.2.3 The Winnow Algorithm . . . . . . . . . . . . . . . . . . . 451

12.2.4 Allowing the Threshold to Vary . . . . . . . . . . . . . . . 4 53

12.2.5 Multiclass Perceptrons . . . . . . . . . . . . . . . . . . . . 455

12.2.6 Transforming the Training Set . . . . . . . . . . . . . . . 456

12.2.7 Pro ble ms With Perceptrons . . . . . . . . . . . . . . . . . 45 7

12.2.8 Parallel Implementation of Perceptrons . . . . . . . . . . 458

12.2.9 Exercises for Section 12.2 . . . . . . . . . . . . . . . . . . 459

12.3 Support-Vector Machines . . . . . . . . . . . . . . . . . . . . . . 461

12.3.1 The Mechanics of an SVM . . . . . . . . . . . . . . . . . . 461

12.3.2 Normalizing the Hyperplane . . . . . . . . . . . . . . . . . 462

12.3.3 Finding Optimal Approximate Separa tors . . . . . . . . . 464

12.3.4 SVM Solutions by Gradient Descent . . . . . . . . . . . . 467

12.3.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 470

12.3.6 Parallel Implementation of SVM . . . . . . . . . . . . . . 471

12.3.7 Exercises for Section 12.3 . . . . . . . . . . . . . . . . . . 472

12.4 Learning from Nearest Neighbors . . . . . . . . . . . . . . . . . . 472

12.4.1 The Framework for Nearest-Neighbor Calcula tions . . . . 473

12.4.2 Learning with One Nearest Neighbor . . . . . . . . . . . . 473

12.4.3 Learning One-Dimensio nal Functions . . . . . . . . . . . . 474

12.4.4 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . 477

12.4.5 Dealing with High-Dimensional Euclidean Data . . . . . . 477

12.4.6 Dealing with Non-Euclidean Distances . . . . . . . . . . . 479

12.4.7 Exercises for Section 12.4 . . . . . . . . . . . . . . . . . . 479

12.5 Comparison of Learning Methods . . . . . . . . . . . . . . . . . . 480

12.6 Summary of Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . 481

12.7 References for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . 483

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution c ompletely characterize the

distribution and would become the model of the data. ✷

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine le arning.

There is no ques tion that some data mining appropriately uses algor ithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning prac-

titioners, such as Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes sense. The typical

case where machine learning is a good approach is whe n we have little idea of

what we are look ing for in the data. For example, it is r ather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that pre dic ts the

ratings of movies by users, based on a sample of their responses, ma chine-

learning algorithms have proved quite successful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more dir e ctly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not able to do better than algorithms

in the typical resume. Since e veryone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an alg orithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For insta nc e , given the set of numbers of Example 1.1, we might

compute their average and standard deviation. Note that these values might

not be the parameters of the Gaussian that best ﬁts the data, although they

will almost cer tainly be very close if the size of the data is large.

There are many diﬀerent approaches to modeling data. We have already

mentioned the possibility of constructing a statistical process whereby the da ta

could have been ge nerated. Mos t other approaches to modeling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

剩余512页未读，继续阅读

rico-yang

粉丝: 0
资源: 1

大数据挖掘入门教材：Web与大规模数据分析

Mining of Massive Datasets.pdf

Mining of Massive Datasets, 英文原版，斯坦福CS246官方教程

Mining of massive datasets

mining of massive datasets

Mining of Massive Dataset的中文版

java+sql server项目之科帮网计算机配件报价系统源代码.zip

【java毕业设计】智慧社区老人健康监测门户.zip

【java毕业设计】智慧社区心理咨询平台（源代码+论文+PPT模板）.zip

计算机系统基础实验LinkLab实验及解答：深入理解ELF文件与链接过程

基于关键词的历时百度搜索指数自动采集资料齐全+详细文档+高分项目+源码.zip

最新资源