大规模数据挖掘：深度探索与应用

5星 · 超过95%的资源需积分: 19 73 浏览量更新于2024-07-28 收藏 2.63MB PDF 举报

"大数据挖掘" 《大数据挖掘》这本书是Anand Rajaraman和Jeﬀrey D. Ullman共同创作的，他们分别来自Kosmix公司和斯坦福大学。该书的版权于2010年和2011年由作者持有。这本书源于斯坦福大学一门名为"Web Mining"的课程，旨在作为高级研究生课程，但同样吸引了许多优秀的本科生。书的内容主要集中在大规模数据的挖掘上，特别关注那些无法一次性装入内存的海量数据。由于对规模的强调，书中很多实例都与互联网或源自互联网的数据有关。作者从算法的角度出发，将数据挖掘视为一种对数据应用算法的过程，而非仅用于训练机器学习引擎的方法。书中涉及的主要主题包括： 1. 分布式文件系统和MapReduce：这是一种用于创建能处理大量数据的并行算法工具。MapReduce是一种编程模型，它简化了在大规模数据集上执行并行计算的复杂性，通过“映射”（map）和“归约”（reduce）两个阶段，使得处理过程可以分布式进行，非常适合处理大数据。 2. 相似性搜索：这是数据挖掘中的关键领域，包括了诸如余弦相似度、Jaccard相似度等技术，用于找出数据集中相似的元素或对象。在网页链接分析、推荐系统和图像识别等领域有着广泛应用。 3. 图数据模型和图算法：书中可能涵盖了如PageRank这样的算法，它是Google搜索引擎排名的重要组成部分，用于评估网页的重要性。图数据模型能够有效地表示和分析网络结构，比如社交网络和互联网的拓扑结构。 4. 数据聚类：通过无监督学习方法，如K-means、DBSCAN等，将数据点分组成具有相似特性的群体，帮助发现数据的内在结构和模式。 5. 降维技术：如主成分分析(PCA)和奇异值分解(SVD)，这些技术可以减少数据的复杂性，同时保持其关键信息，有助于提高分析效率和可视化效果。 6. 异常检测：寻找数据集中不符合正常模式的异常点，这在欺诈检测、故障诊断等领域非常有用。 7. 机器学习基础：尽管本书更注重算法而非机器学习，但可能会涵盖一些基础的监督和非监督学习算法，如决策树、朴素贝叶斯和神经网络等。《大数据挖掘》是一本深入探讨大数据处理技术的教材，对于想要理解如何在大规模数据集上进行有效分析和挖掘的读者来说，是一份宝贵的资源。书中结合理论与实践，介绍了处理海量数据的核心工具和技术，对于从事大数据分析、数据科学以及相关领域的专业人士来说，具有很高的学习价值。

xvi CONTENTS

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning pra c-

titioners, such as Bayes nets, support-vector machines , decision trees, hidden

Markov models, and ma ny others .

There are situations where using data in this way makes sense. The typical

case where machine lear ning is a good approach is when we have little idea of

what we are looking for in the data. For exa mple, it is rather unclear what

it is about movies that makes certain movie-goer s like or dis like it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite success ful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s re sumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty goo d idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no adva ntage to ma chine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer sc ie ntists have looked at data mining as an algorithmic

problem. I n this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard deviation. Note that these values might

not be the parameters of the Gaussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are many diﬀerent approaches to modeling data. We have already

mentioned the possibility of c onstructing a sta tistical process whereby the data

could have been generated. Most other approaches to modeling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

4 CHAPTER 1. DATA MINING

The cases clustered around some o f the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not get sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extraction

The typical feature-based model looks for the most extreme examples of a phe-

nomenon and represents the data by these examples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between objects is represented by

ﬁnding the strongest statistical dependencies among these objects and using

only those in representing all s tatistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for da ta that consists of “bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the characterization of

the data that we seek. The orignal application of this sor t of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a store or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pair s of sets that have a relatively large fraction of

their elements in common. An exa mple is treating custo mers a t an on-

line store like Amazon as the se t of items they have bought. In order

for Amazon to recommend something else they might like, Amazo n can

look for “similar” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single-minded, tha t is, they bought only o ne kind of

thing, then cluster ing customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, and represent the data by these connections. We discuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

剩余456页未读，继续阅读

zycbobby

粉丝: 3
资源: 1

大规模数据挖掘：深度探索与应用

斯坦福大学book-Mining of Massive Datasets

Anand.Rajaraman-Mining of Massive Datasets

mining of massive datasets

Mining of massive datasets

CPPC++_PCLPoint Cloud Library点云库学习记录.zip

基于Python的百度百科爬虫.zip

CPPC++_Qt 之 GUI 控件使用 网络 架构原理 运行机制理解DTK 重绘控件方式的框架解析IDE 技巧.zip

10020.doc

使用加权最小二乘法和加权最小最大法进行优Matlab实现.rar

【多变量输入单步预测】基于CEEMDAN-VMD-CNN-BILSTM的风电功率预测研究附Matlab代码.rar

最新资源

CPPC++_Qt 之 GUI 控件使用网络架构原理运行机制理解DTK 重绘控件方式的框架解析IDE 技巧.zip