《海量数据挖掘》：大规模数据处理与算法应用

需积分: 10 63 浏览量更新于2024-07-26 收藏 1.98MB PDF 举报

《大规模数据挖掘的艺术》是一本由Anand Rajaraman和Jeffrey D. Ullman共同编著的经典著作，两位作者分别来自Kosmix, Inc.和斯坦福大学。这本书起源于他们在斯坦福大学开设的高级研究生课程CS345A，名为“Web Mining”，虽然最初是针对研究生设计的，但其内容逐渐吸引了高年级本科生的兴趣。课程的核心在于处理大规模数据的数据挖掘，尤其是那些无法完全容纳在主内存中的数据。本书的主要关注点在于大数据时代的数据挖掘，特别是针对互联网数据和其衍生数据的分析。它强调的是算法驱动的方法，而非依赖于数据来训练机器学习模型。书中探讨了以下核心主题： 1. 分布式文件系统和MapReduce：这是构建能够处理海量数据并实现并行计算的关键工具。MapReduce提供了一种编程模型，使得复杂的数据处理任务可以分解成一系列可扩展的小任务，分布式在多台计算机上执行。 2. 相似性搜索：这是数据挖掘中的关键技术，涉及在大量数据中查找相似项或模式，比如在搜索引擎中找出与用户查询最相关的网页。书中的内容将深入介绍各种相似度度量方法以及如何利用索引和数据结构优化搜索效率。 3. 数据压缩和数据去重：在处理大规模数据时，有效的数据压缩和去重技术对于降低存储需求和提高处理速度至关重要。作者会讲解这些技术的原理和实践应用。 4. 倒排索引和文档检索：倒排索引是用于快速定位包含特定关键词的文档的高效数据结构，这对于信息检索和文本挖掘至关重要。 5. 贝叶斯网络和概率图模型：这些统计建模工具被用于预测、分类和理解不确定性，尤其是在推荐系统和广告个性化等领域。 6. 高维数据和降维技术：面对大量特征的数据集，降维方法如PCA（主成分分析）和SVD（奇异值分解）有助于可视化和挖掘潜在结构。 7. 社交网络分析：通过分析用户行为和连接关系，本书将探讨社交网络挖掘在推荐系统、社区检测和影响力传播等方面的应用。 8. 实时流数据处理：随着物联网和实时数据产生的增长，处理连续不断的数据流成为挑战。书中会涉及流计算框架和技术。 9. 泛化和误差分析：讨论如何在大规模数据挖掘中保持模型的性能和泛化能力，避免过拟合问题。《大规模数据挖掘的艺术》不仅提供了理论基础，还包含了大量的实践案例和示例，帮助读者掌握处理大规模数据的实用技能，适用于对数据科学、机器学习和信息技术有深入兴趣的学生和专业人员。无论是想要深入理解大数据处理背后的算法，还是寻求在实际项目中应用数据挖掘技术，这本书都是不可或缺的参考资料。

4 CHAPTER 1. DATA MINING

The cases clustered around some o f the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not get sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extraction

The typical feature-based model looks for the most extreme examples of a phe-

nomenon and represents the data by these examples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between objects is represented by

ﬁnding the strongest statistical dependencies among these objects and using

only those in representing all s tatistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for da ta that consists of “bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the characterization of

the data that we seek. The orignal application of this sor t of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a store or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pair s of sets that have a relatively large fraction of

their elements in common. An exa mple is treating custo mers a t an on-

line store like Amazon as the se t of items they have bought. In order

for Amazon to recommend something else they might like, Amazo n can

look for “similar” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single-minded, tha t is, they bought only o ne kind of

thing, then cluster ing customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, and represent the data by these connections. We discuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

1.2. STATISTICAL LIMITS ON DATA MINING 5

1.2.1 Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it could

ﬁnd, including credit-card receipts, hotel records, travel da ta, and many other

kinds of information in order to track terrorist activity. This idea naturally

caused great concern among privacy advocates, and the project, called TIA,

or Total Information Awareness, was eventually killed by Congress, although

it is unclear whether the project in fact exists under another name. It is not

the purpose of this book to discus s the diﬃcult issue of the privacy-security

tradeoﬀ. However, the prospect of TIA or a system like it does raise technical

questions about its feasibility and the realism of its assumptions.

The concern raised by many is that if you lo ok at so much data, and you try

to ﬁnd within it activities that look like terroris t behavior, are you not going to

ﬁnd many innocent activities – or even illicit activities that are not terrorism –

that will result in visits from the police and maybe worse than just a v isit? The

answer is that it all depends on how narrowly you deﬁne the activities that you

look for. Statisticians have see n this problem in many guises and have a theory,

which we introduce in the next section.

1.2.2 Bonferroni’s Principle

Suppose you have a certain amount of da ta, and you look for events of a cer-

tain typ e within that data. Yo u can expect events of this type to occur, even if

the data is co mpletely random, and the number of occurrences of these events

will grow as the size o f the data grows. These occurrences are “bogus,” in the

sense that they have no cause other than that random data will always have

some number of unusual features that look signiﬁcant but aren’t. A theore m

of statistics, known as the Bonferroni correction gives a statistically sound way

to avoid most of these bogus positive responses to a search through the data.

Without going into the statistical details, we oﬀer an informal version, Bon-

ferroni’s principle, that helps us avoid treating ra ndom occurrences as if they

were real. Calculate the expected number of occurrences of the e vents you are

looking for, on the assumption that data is random. If this number is signiﬁ-

cantly larger than the number of real instances you hop e to ﬁnd, then you must

exp ect almost anything you ﬁnd to be bogus, i.e., a statistical artifact rather

than evidence of what you are looking for. This obse rvation is the informal

statement of Bonferroni’s principle.

In a situation like searching for terrorists, where we expect that there are

few terr orists operating at any one time, Bonferroni’s principle says that we

may o nly detect terrorists by looking for events that are so rare that they are

unlikely to occur in random data. We shall give an extended example in the

next section.

6 CHAPTER 1. DATA MINING

1.2.3 An Example of B onferroni’s Principle

Suppose there are believed to be some “ e vil-doers” out there, and we want

to detect them. Suppose further that we have reason to believe tha t periodi-

cally, evil-doers g ather at a hotel to plot their evil. Le t us ma ke the following

assumptions about the size of the problem:

1. There are one billion people who might be evil-doers.

2. Everyone goes to a hotel one day in 100.

3. A hotel holds 100 people. Hence, there are 100,000 hotels – enough to

hold the 1% of a billion people who visit a hotel on any given day.

4. We sha ll examine hotel records for 1000 days.

To ﬁnd evil-do e rs in this data, we shall look for people who, on two diﬀerent

days, were both at the same hotel. Suppose, however, that there really are no

evil-doers. That is, everyone behaves at random, deciding with probability 0.01

to visit a hotel on any given day, and if so, choosing one of the 10

hotels at

random. Would we ﬁnd any pairs of people who appear to be evil-doers?

We c an do a simple approximate calculation as follows. The probability of

any two pe ople both deciding to visit a hotel on any given day is .0001. The

chance that they will visit the same hotel is this probability divided by 10

the number of hotels. Thus, the chance that they will visit the same hotel on

one given day is 10

−9

. The chance that they will visit the same hotel on two

diﬀerent given days is the square of this number, 10

−18

. Note that the hotels

can be diﬀerent on the two days.

Now, we must consider how many events will indicate evil-doing. An “event”

in this s e nse is a pair of people and a pair of days, such that the two people

were at the same hotel on each of the two days. To simplify the arithmetic, note

that for large n,





is about n

/2. We shall use this approximation in what

follows. Thus, the number of pairs of people is





= 5 × 10

. The number

of pairs of days is



1000



= 5 × 10

. The expected number of events that look

like evil-doing is the product of the number of pairs of people, the number of

pairs of days, and the probability that any one pair of people and pair of days

is an instance of the behavior we are looking for. That number is

5 × 10

× 5 × 10

× 10

−18

= 250, 000

That is, there will be a quarter of a million pairs of people w ho look like evil-

doers, even though they are not.

Now, suppose there r e ally are 10 pairs of evil-doers o ut there. T he police

will need to investigate a quarter of a million other pairs in order to ﬁnd the real

evil-doers. In addition to the intrusion on the lives of half a million innocent

people, the work involved is suﬃciently great that this approach to ﬁnding

evil-doers is probably not feasible.

1.3. THINGS USEFUL TO KNOW 7

1.2.4 Exercises for Section 1.2

Exercise 1.2.1 : Using the information from Section 1.2.3, what would be the

number of suspected pairs if the following changes were made to the data (and

all other numbers remained as they were in that section)?

(a) The number of days of obser vation was rais e d to 2000 .

(b) The number of people observed was raised to 2 billion (a nd there were

therefore 200,000 hotels).

same time on three diﬀerent days.

! Exercise 1.2.2 : Suppose we have information about the supermarket pur-

chases of 100 million people. Each person goes to the supermarket 100 times

in a year and buys 10 of the 1000 items that the s upermarket se lls. We believe

that a pair of terrorists will buy exactly the same set of 10 items (pe rhaps the

ingredients for a bomb?) at some time during the year. If we search for pairs of

people who have bought the s ame set of items, would we expect that any such

people found were truly terrorists?

1.3 Things Useful to Know

In this section, we oﬀer brief introductions to subjects that you may or may

not have seen in your study of other courses. Each will be useful in the study

of data mining. They include:

1. The TF.IDF measure of word importance.

2. Hash functions and their use.

3. Secondary storage (disk) and its eﬀect on running time of algorithms.

4. The base e of natural logarithms and identities involving that constant.

5. Power laws.

1.3.1 Importance of Words in Documents

In several applications of data mining, we shall be faced with the problem of

categorizing documents (sequences of words) by their topic. Typically, topics

are identiﬁed by ﬁnding the s pecial words that characterize documents about

that topic. For instance, articles about baseball would tend to have many

occurrences of words like “ball,” “bat,” “pitch,”, “run,” and so on. Once we

That is, assume our hypothesis that terrorists will surely buy a set of 10 items in common

at some time during the year. We don’t want to address the matter of whether or not terrorists

would necessarily do so.

8 CHAPTER 1. DATA MINING

have classiﬁed documents to determine they are about baseball, it is not hard

to notice that words such as these appear unusually frequently. However, until

we have made the classiﬁcation, it is not possible to identify these words as

characteristic.

Thus, classiﬁcation often star ts by looking at documents, and ﬁnding the

signiﬁcant words in those documents. Our ﬁrst guess might be that the words

appearing most frequently in a document ar e the most signiﬁcant. However,

that intuition is exactly opposite of the truth. The most fr equent words will

most surely b e the common words such as “the” o r “a nd,” which help build

ideas but do not carry any signiﬁcance themselves. In fact, the several hundred

most common words in English (called stop words) are often removed from

documents before any attempt to classify them.

In fac t, the indicators of the topic are r e lative ly rare words. However, not

all rare words are equally us e ful as indicators. There are certain words, for

example “notwithstanding” or “albeit,” that appear rarely in a collection of

documents, yet do not tell us a nything useful. On the other hand, a word like

“chukker” is probably equally rare, but tips us oﬀ that the document is about

the sport of polo. The diﬀerence between rar e words that tell us something and

those that do not has to do with the concentration of the useful words in just a

few documents. Tha t is, the presence of a word like “albeit” in a document does

not make it terribly more likely that it will appear multiple times. However,

if an article mentions “chukker” once , it is likely to tell us what happened in

the “ﬁrst chukker,” then the “second chukker,” and so o n. That is, the word is

likely to be r epeated if it appears at all.

The formal measure of how concentrated into relatively few documents are

the occurrences of a given word is called TF.IDF (Term Frequency times In-

verse Document Frequency). It is normally computed as follows. Suppose we

have a collection of N documents. Deﬁne f

to be the frequency (number of

occurrences) of ter m (word) i in document j. Then, deﬁne the term frequency

to be:

max

That is, the term frequency of term i in document j is f

normalized by dividing

it by the maximum number of occurrences of any term (perhaps excluding s top

words) in the same document. Thus, the most frequent term in document j

gets a TF of 1, and other terms get fractions as their term frequency for this

document.

The IDF for a term is deﬁned as follows. Suppose term i appears in n

of the N documents in the collection. Then IDF

= log

(N/n

). The TF.IDF

score for term i in document j is then deﬁned to be TF

× IDF

. The terms

with the highest TF.IDF sco re are often the terms that best characterize the

topic of the document.

Example 1.3 : Suppose our repository consists of 2

= 1,0 48,576 documents.

Suppose word w appears in 2

= 1024 of these documents. Then IDF

剩余339页未读，继续阅读

a7692

粉丝: 0
资源: 3

《海量数据挖掘》：大规模数据处理与算法应用

Ming of Massive Datasets

Mining of Massive Datasets

Mining of massive datasets

mining of massive datasets

YOLO算法-城市电杆数据集-496张图像带标签-电杆.zip

(177406840)JAVA图书管理系统毕业设计(源代码+论文).rar

(35734838)信号与系统实验一实验报告

YOLO算法-椅子检测故障数据集-300张图像带标签.zip

基于小程序的新冠抗原自测平台小程序源代码（java+小程序+mysql+LW）.zip

YOLO算法-俯视视角草原绵羊检测数据集-4133张图像带标签-羊.zip

最新资源