大数据挖掘：海量数据处理与算法应用

需积分: 9 137 浏览量更新于2024-07-18 收藏 2.86MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

《大规模数据挖掘》是一本基于斯坦福大学计算机科学课程CS246（大规模数据挖掘）和高级研究生课程CS345A（Web挖掘）的教材。本书面向的是本科生，无需专业背景知识，旨在提供深入的数据挖掘学习资源。作者Anand Rajaraman和Jeffrey D. Ullman结合他们在斯坦福多年开发的课程材料编写而成，特别关注于处理海量数据，即那些超出了常规内存容量的数据。书中核心内容强调了数据挖掘在大数据环境下的应用，尤其是针对互联网数据，如网页和其衍生的数据。书中的重点是算法驱动的方法，而非将数据用于训练某种类型的机器学习模型。主要内容包括： 1. 分布式文件系统和MapReduce：作为设计能够处理大规模数据的并行算法的关键工具，MapReduce允许在分布式环境中对数据进行高效处理，通过分解任务并行化来优化性能。 2. 相似性搜索：这是一种核心数据挖掘技术，涉及如何在大量数据中快速找到相似项或模式，常用的算法有局部敏感哈希（LSH）等，这些技术对于推荐系统、搜索引擎优化和社交网络分析至关重要。 3. 负载平衡与分区策略：如何在海量数据上实现有效的数据分割和管理，以避免单点过载，确保系统的稳定性和效率。 4. 关系型数据库和NoSQL数据库：理解不同类型数据库如何存储和查询大规模数据，以及如何选择适合大规模数据挖掘的数据库系统。 5. 数据预处理：包括清洗、整合、转换和规范化，这些步骤对于从原始数据中提取有价值的信息至关重要。 6. 主成分分析（PCA）、聚类分析和关联规则挖掘：这些统计和机器学习方法在发现数据内在结构和模式方面起着关键作用。 7. 预测模型：包括时间序列分析和回归模型，用以预测未来的趋势或行为，广泛应用于商业智能和数据分析领域。 8. 实时流数据处理：随着物联网和社交网络的兴起，如何实时处理不断产生的数据流，以支持实时决策和分析。 9. 可扩展性和容错性：随着数据规模的增长，如何保证数据挖掘系统在面临硬件故障或流量波动时仍能持续运行。通过这本书，读者可以了解到如何利用现代技术和算法来应对大数据挑战，为数据密集型应用开发高效、可扩展的解决方案。同时，它也为进一步探索数据科学提供了丰富的参考资料。

资源详情

资源推荐

4 CHAPTER 1. DATA MINING

The cases clustered around some of the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not get sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. !

1.1.5 Feature Extraction

The typical feature-based model looks for the most extre me examples of a phe-

nomenon and represents the data b y these examples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between ob jects is represented by

ﬁnding the strongest statistical dependencies among these objects and using

only those in representing all statistical connections. Some of the important

kinds of feature extra ction from lar ge-scale data that we shall study are:

1. Frequent Itemsets.Thismodelmakessensefordatathatconsistsof“bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the characterization o f

the data that we seek. The orignal application of this sort of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a store or super market.

2. Similar Items.Often,yourdatalookslikeacollectionofsets,andthe

objective is to ﬁnd pairs of sets that have a relatively large fraction of

their elemen ts in common. An example is treating customers atanon-

line stor e like Amazon as the set of items they have bought. In order

for Amazon to recommend something else they might like, Amazon can

look for “similar” customers and recommend something many ofthese

customers have b ought. This process is called “collab orative ﬁltering.”

If customers were sing le-minded, that is, they bought only one kind of

thing, then clustering customers might work. However, sincecustomers

tend to have in terests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, and represent the data by these connections.Wediscuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

Acommonsortofdata-miningprobleminvolvesdiscoveringunusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

1.2. STATISTICAL LIMITS ON DATA MINING 5

1.2.1 Total Information Awareness

In 2002, the B us h administration put forward a plan to mine allthedataitcould

ﬁnd, including credit-card receipts, hotel records, traveldata,andmanyother

kinds of information in order to track terrorist activity. This idea naturally

caused great concern among privacy advocates, and the project, called TIA,

or Total Information Awareness,waseventuallykilledbyCongress,although

it is unclear whether the project in fact exists under anothername. Itisnot

the purpose of this book to discuss the diﬃcult issue of the privacy-security

tradeoﬀ. Ho wever, the prospect of TIA or a system like it does raise technical

questions about its feasibility and the realism of its assumptions.

The concern raised by many is that if you look at so much da ta , and you

try to ﬁnd within it activities that look like terrorist behavior, are you not

going to ﬁnd many inno cent activities — or even illicit activities that are not

terrorism — that will result in visits from the police and maybe worse than

just a visit? The answer is that it all depends on how narrowly you deﬁne the

activities that you look for. Statisticians have seen this problem in many guises

and have a theory, which we introduce in the next section.

1.2.2 Bonferroni’s Principle

Suppose you have a certain amount of data, and you look for events of a cer-

tain type within that data. You can expect events of this type to occur, even if

the data is completely random, and the number of occurrences of these events

will grow as the size of the data grows. These occurrences are “bogus,” in the

sense that they have no cause o ther than that random data will always have

some number of unusual features that look signiﬁcant but aren’t. A theorem

of statistics, known as the Bonferroni c orre ction gives a statistically sound way

to avoid most of these bogus positive responses to a search through the data.

Without going into the statistical details, we oﬀer an informal version, Bon-

ferroni’s principle,thathelpsusavoidtreatingrandomoccurrencesasifthey

were real. Calculate the expected number of occurrences of the events you are

looking for, on the assumption that data is random. If this number is signiﬁ-

cantly larger than the number of real instances you hope to ﬁnd, then you must

expect almost anything you ﬁnd to be b ogus, i.e., a statistical artifact rather

than evidence of what you are looking for. This observation istheinformal

statement of Bonferroni’s principle.

In a situation like searching for terrorists, wher e we expectthatthereare

few terrorists operating at any one time, Bonferroni’s principle says that we

may only detect terrorists by looking for events that are so rare that they are

unlikely to occur in random data. We shall give an extended example in the

next section.

6 CHAPTER 1. DATA MINING

1.2.3 An Example of Bonferroni’s Principle

Suppose there are believed to be some “evil-doers” out there,andwewant

to detect them. Suppose further that we hav e reason to believethatperiodi-

cally, evil-doers gather at a hotel to plot their evil. Let us make the following

assumptions about the size of the problem:

1. There are one billion people who might b e evil-doers.

2. Everyone goes to a hotel one day in 100.

3. A hotel holds 100 p eople. Hence, there are 100,000 hotels — eno ug h to

hold the 1% of a billion people who visit a hotel on any given day.

4. We shall examine hotel records for 1000 days.

To ﬁnd evil-do ers in this data, we shall look for people who, ontwodiﬀerent

days, were both at the same hotel. Suppose, however, that there really are no

evil-do ers. That is, everyone behaves at r andom, deciding with probability 0.01

to visit a hotel on an y given day, and if so, choosing one of the 10

hotels at

random. Would we ﬁnd any pairs of people who appear to be evil-doers?

We can do a simple approximate calculation as follows. The probability of

any two p eople both deciding to visit a hotel on any given day is.0001. The

chance that they will visit the same hotel is this probabilitydividedby10

the number of hotels. Thus, the chance that they will visit thesamehotelon

one given day is 10

−9

.Thechancethattheywillvisitthesamehotelontwo

diﬀerent given days is the square of this number, 10

−18

.Notethatthehotels

can be diﬀerent on the two days.

Now, we must consider how many events will indicate evil-doing. An “event”

in this sense is a pair of people and a pair of days, such that thetwopeople

were at the same hotel on each of the two days. To simplify the arithmetic, note

that for large n,

is about n

/2. We shall use this approximation in what

follows. Thus, the num ber of pairs of people is

=5× 10

.Thenumber

of pairs of days is

1000

=5× 10

.Theexpectednumberofeventsthatlook

like evil-doing is the product of the number of pairs of people, the number of

pairs of days, and the probability that any one pair of people and pair of days

is an instance of the behavior we are looking for. That number is

5 × 10

× 5 × 10

× 10

−18

=250, 000

That is, there will be a quarter o f a million pairs of people wholooklikeevil-

doers, even though they are not.

Now, suppose there really are 10 pairs of evil-doers out there. The police

will need to investigate a quarter of a million other pairs in order to ﬁnd the real

evil-do ers. In addition to the intrusion on the lives of half amillioninnocent

people, the work involved is suﬃciently great that this approach to ﬁnding

evil-do ers is probably not feasible.

1.3. THINGS USEFUL TO KNOW 7

1.2.4 Exercises for Section 1.2

Exercise 1.2.1 : Using the information from Section 1.2.3, what would be the

number of suspected pairs if the following changes were made to the data (and

all other numbers remained as they were in that section)?

a) The number of days of observation was raised to 2000.

b) The number of people observed was raised to 2 billion (and there were

therefore 200,000 hotels).

c) We only reported a pair as suspect if they were at the same hotel at the

same time on three diﬀerent days.

!Exercise1.2.2: Suppose we have information about the supermarket pur-

chases of 100 million people. Each person goes to the supermarket 100 times

in a year and buys 10 of the 1000 items that the supermarket sells. We believe

that a pair of terrorists will buy exactly the same set of 10 items (perhaps the

ingredients for a bomb?) at some time during the year. If we search for pairs of

people who have bought the same set of items, would we expect that any such

people found were truly terrorists?

1.3 Things Useful to Know

In this section, we oﬀer brief introductions to subjects thatyoumayormay

not have seen in your study of other courses. Each will be useful in the study

of data mining. They include:

1. The TF.IDF measure of word importance.

2. Hash functions and their use.

3. Secondary storage (disk) and its eﬀect on running time of algorithms.

4. The base e of natural logarithms and identities involving that constant.

5. Power laws.

1.3.1 Importance of Words in Documents

In several applications of data mining, we shall be faced withtheproblemof

categorizing documents (sequences of words) by their topic.Typically,topics

are identiﬁed by ﬁnding the sp ecial words that characterize documents about

that topic. For instance, articles about baseball would tendtohavemany

occurrences of words like “ball,” “bat,” “pitch,”, “run,” and so on. Once we

That is, assume our hypothesis that terrorists will surely buy a set of 10 items in common

at some time during the year. We don’t want to address the matter of whether or not terrorists

would necessarily do so.

8 CHAPTER 1. DATA MINING

have classiﬁed documents to determine they are about baseball, it is not hard

to notice that words such as these appear unusually frequently. However, until

we have made the classiﬁcation, it is not possible to identifythesewordsas

characteristic.

Thus, classiﬁcation often starts by looking at documents, and ﬁnding the

signiﬁcant words in those do cuments. Our ﬁrst guess might be that the w ords

appearing most frequently in a document are the most signiﬁcant. However,

that intuition is exactly opposite of the truth. The most frequent words will

most surely be the common words such as “the” or “and,” which help build

ideas but do not carry any signiﬁcance themselves. In fact, the several hundred

most common words in English (called stop words)areoftenremovedfrom

documents before any attempt to classify them.

In fact, the indicators of the topic are relatively rare words. However, not

all rare words are equally useful as indicators. There are certain words, for

example “notwithstanding” or “albeit,” that a ppear rarely in a collection of

documents, yet do not tell us anything useful. On the other hand, a word like

“chukker” is probably equally rare, but tips us oﬀ that the document is about

the sport of polo. The diﬀerence between rare words that tell us something and

those that do not has to do with the concentration of the usefulwordsinjusta

few documents. That is, the presence of a word like “albeit” inadocumentdoes

not make it terribly more likely that it will appear multiple times. However,

if an article mentions “chukker” once, it is likely to tell us what happened in

the “ﬁrst chukk er,” then the “second chukker,” and so on. Thatis,thewordis

likely to be repeated if it appears at all.

The formal measure of how concentrated into relatively few documents are

the occurrences of a given word is called TF.IDF (Term Frequency times In-

verse Document Frequency). It is normally computed as follows. Suppose we

have a collection of N documents. Deﬁne f

to be the frequency (number of

occurrences) of term (word) i in do cument j.Then,deﬁnetheterm frequency

to be:

max

That is, the term frequency of term i in document j is f

normalized by dividing

it by the maximum number of occurrences of any term (perhaps excluding stop

words) in the same document. Thus, the most frequent term in document j

gets a TF of 1, and other terms get fractions as their term frequency for this

document.

The IDF for a term is deﬁned as follows. Suppose term i appears in n

of the N documents in the collection. Then IDF

=log

(N/n

). The TF.IDF

score for term i in document j is then deﬁned to be TF

× IDF

.Theterms

with the highest TF.IDF score are o ften the terms that best characterize the

topic of the document.

Example 1.3 : Suppose our repository consists of 2

=1,048,576documents.

Suppose word w app ears in 2

=1024ofthesedocuments. ThenIDF

剩余339页未读，继续阅读

weixin_39516685

粉丝: 0
资源: 43

大数据挖掘：海量数据处理与算法应用

Mining of Massive Datasets（2nd edition）

Mining of massive dateset

斯坦福大学book-Mining of Massive Datasets

mining of massive datasets中文版

fundamentals of massive mimo

大模型 英文怎么翻译

NMPnet pytorch

public dataset for AI

中兴massive mimo 白皮书 csdn

massive mimo networks spectral

分析5G网络中Massive MIMO技术

massive mimo ber matlab代码

Massive MIMO 接收机抗干扰技术最新文献

cell-free massive mimo

massive mimo beamforming实验仿真

prompt-Bert

massive mimo的相关代码

massive mimo与传统mimo

massive mimo 通信系统matlab代码

enabling lpwan massive access : grant-free random access with massive mimo,

最新资源

大模型英文怎么翻译