互联网与网页建模：概率方法与算法解析

需积分: 9 117 浏览量更新于2023-07-05 收藏 2.11MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

PREFACE xix

NASA and the Jet Propulsion Laboratory, the Department of Energy and Lawrence

Livermore National Laboratory, IBM Research, Microsoft Research, Sun Microsys-

tems, HNC Software, the University of California MICRO Program, and a Laurel

Wilkening Faculty Innovation Award. Part of the book was written while P.F. was

visiting the School of Information and Computer Science (ICS) at UCI, with partial

funding provided by the University of California. We also would like to acknowledge

the general support we have received from the Institute for Genomics and Bioinfor-

matics (IGB) at UCI and the California Institute for Telecommunications and Infor-

mation Technology (Cal(IT)

). Within IGB, special thanks go the staff, in particular

Suzanne Knight, Janet Ko, Michele McCrea, and Ann Marie Walker. We would like

to acknowledge feedback and general support from many members of Cal(IT)

and

thank in particular its directors Bill Parker, Larry Smarr, Peter Rentzepis, and Ramesh

Rao, and staff members Catherine Hammond, Ashley Larsen, Doug Ramsey, Stuart

Ross, and Stephanie Sides. We thank a number of colleagues for discussions and

feedback on various aspects of the Web and probabilistic modeling: David Eppstein,

Chen Li, Sharad Mehrotra, and Mike Pazzani at UC Irvine, as well as Albert-László

Barabási, Nicoló Cesa-Bianchi, Monica Bianchini, C. Lee Giles, Marco Gori, David

Heckerman, David Madigan, Marco Maggini, Heikki Mannila, Chris Meek, Amnon

Meyers, Ion Muslea, Franco Scarselli, Giovanni Soda, and Steven Scott. We thank all

of the people who have helped with simulations or provided feedback on the various

versions of this manuscript, especially our students Gianluca Pollastri, Alessandro

Vullo, Igor Cadez, Jianlin Chen, Xianping Ge, Joshua O’Madadhain, Scott White,

Alessio Ceroni, Fabrizio Costa, Michelangelo Diligenti, Sauro Menchetti, and Andrea

Passerini. We also acknowledge Jean-Pierre Nadal, who brought to our attention the

Fermi model of power laws. We thank Xinglian Yie and David O’Hallaron for pro-

viding their data on search engine queries in Chapter 8. We also thank the staff from

John Wiley & Sons, Ltd, in particular Senior Editor Sian Jones and Robert Calver,

and Emma Dain at T

T Productions Ltd. Finally, we acknowledge our families and

friends for their support in the writing of this book.

Pierre Baldi, Paolo Frasconi and Padhraic Smyth

October 2002, Irvine, CA

Mathematical Background

In this chapter we review a number of basic concepts in probabilistic modeling and

data analysis that are used throughout the book, including parameter estimation, mix-

ture models, graphical models, classiﬁcation, clustering, and power-law distributions.

Each of these topics is worthy of an entire chapter (or even a whole book) by itself, so

our treatment is necessarily brief. Nonetheless, the goal of this chapter is to provide

the introductory foundations for models and techniques that will be widely used in

the remainder of the book. Readers who are already familiar with these concepts, or

who want to avoid mathematical details during a ﬁrst reading, can safely skip to the

next chapter. More speciﬁc technical mathematical concepts or mathematical com-

plements that concern only a speciﬁc chapter rather than the entire book are given in

Appendix A.

1.1 Probability and Learning from a Bayesian

Perspective

Throughout this book we will make frequent use of probabilistic models to charac-

terize various phenomena related to the Web. Both theory and experience have shown

that probability is by far the most useful framework currently available for modeling

uncertainty. Probability allows us to reason in a coherent manner about events and

make inferences about such events given observed data. More speciﬁcally, an event e

is a proposition or statement about the world at large. For example, let e be the propo-

sition that ‘the number of Web pages in existence on 1 January 2003 was greater than

ﬁve billion’. A well-deﬁned proposition e is either true or false – by some reasonable

deﬁnition of what constitutes a Web page, the total number that existed in January

2003 was either greater than ﬁve billion or not. There is, however, considerable uncer-

tainty about what this number was back in January 2003 since, as we will discuss later

in Chapters 2 and 3, accurately estimating the size of the Web is a quite challenging

problem. Consequently there is uncertainty about whether the proposition e is true or

not.

Modeling the Internet and the Web: Probabilistic Methods and Algorithms.

P. Baldi, P. Frasconi and P. Smyth

 2003 P. Baldi, P. Frasconi and P. Smyth.

Published by John Wiley & Sons, Ltd.

ISBN: 0-470-84906-1

2 PROBABILITY AND LEARNING FROM A BAYESIAN PERSPECTIVE

A probability, P(e), can be viewed as a number that reﬂects our uncertainty about

whether e is true or false in the real world, given whatever information we have avail-

able. This is known as the ‘degree of belief’ or Bayesian interpretation of probability

(see, for instance, Berger 1985; Box and Tiao 1992; Cox 1964; Gelman et al. 1995;

Jaynes 2003) and is the one that we will use by default throughout this text. In fact,

to be more precise, we should use a conditional probability P(e | I) in general to

represent degree of belief, where I is the background information on which our belief

is based. For simplicity of notation we will often omit this conditioning on I, but it

may be useful to keep in mind that everywhere you see a P(e) for some proposition

e, there is usually some implicit background information I that is known or assumed

to be true.

The Bayesian interpretation of a probability P(e) is a generalization of the more

classic interpretation of a probability as the relative frequency of successes to total

trials, estimated over an inﬁnite number of hypothetical repeated trials (the so-called

‘frequentist’ interpretation). The Bayesian interpretation is more useful in general,

since it allows us to make statements about propositions such as ‘the number of Web

pages in existence’ where a repeated trials interpretation would not necessarily apply.

It can be shown that, under a small set of reasonable axioms, degrees of belief can

be represented by real numbers and that when rescaled to the [0, 1] interval these

degrees of conﬁdence must obey the rules of probability and, in particular, Bayes’

theorem (Cox 1964; Jaynes 1986, 2003; Savage 1972). This is reassuring, since it

means that the standard rules of probability still apply whether we are using the degree

of belief interpretation or the frequentist interpretation. In other words, the rules for

manipulating probabilities such as conditioning or the law of total probability remain

the same no matter what semantics we attach to the probabilities.

The Bayesian approach also allows us to think about probability as being a dynamic

entity that is updated as more data arrive – as we receive more data we may naturally

change our degree of belief in certain propositions given these new data. Thus, for

example, we will frequently refer to terms such as P(e | D) where D is some data.

In fact, by Bayes’ theorem,

P(e | D) =

P(D | e)P (e)

P(D)

. (1.1)

The interpretation of each of the terms in this equation is worth discussing. P(e)is your

belief in the event e before you see any data at all, referred to as your prior probability

for e or prior degree of belief in e. For example, letting e again be the statement that

‘the number of Web pages in existence on 1 January 2003 was greater than ﬁve billion’,

P(e)reﬂects your degree of belief that this statement is true. Suppose you now receive

some data D which is the number of pages indexed by various search engines as of

1 January 2003. To a reasonable approximation we can view these numbers as lower

bounds on the true number and let’s say for the sake of argument that all the numbers

are considerably less than ﬁve billion. P(e | D) now reﬂects your updated posterior

belief in e given the observed data and it can be calculated by using Bayes’theorem via

MATHEMATICAL BACKGROUND 3

Equation (1.1). The right-hand side of Equation (1.1) includes the prior, so naturally

enough the posterior is proportional to the prior.

The right-hand side also includes P(D | e), which is known as the likelihood of the

data, i.e. the probability of the data under the assumption that e is true. To calculate the

likelihood we must have a probabilistic model that connects the proposition e we are

interested in with the observed data D – this is the essence of probabilistic learning.

For our Web page example, this could be a model that puts a probability distribution

on the number of Web pages that each search engine may ﬁnd if the conditioning event

is true, i.e. if there are in fact more than ﬁve billion Web pages in existence. This could

be a complex model of how the search engines actually work, taking into account all

the various reasons that many pages will not be found, or it might be a very simple

approximate model that says that each search engine has some conditional distribution

on the number of pages that will be indexed, as a function of the total number that

exist. Appendix A provides examples of several standard probability models – these

are in essence the ‘building blocks’ for probabilistic modeling and can be used as

components either in likelihood models P(D | e) or as priors P(e).

Continuing with Equation (1.1), the likelihood expression reﬂects how likely the

observed data are, given e and given some model connecting e and the data. If P(D | e)

is very low, this means that the model is assigning a low probability to the observed

data. This might happen, for example, if the search engines hypothetically all reported

numbers of indexed pages in the range of a few million rather than in the billion range.

Of course we have to factor in the alternative hypothesis, ¯e, here and we must ensure

that both P(e)+ P(¯e) = 1 and P(e | D) +P(¯e | D) = 1 to satisfy the basic axioms

of probability. The ‘normalization’ constant in the denominator of Equation (1.1) can

be calculated by noting that P(D) = P(D | e)P (e) + P(D |¯e)P(¯e). It is easy to see

that P(e | D) depends both on the prior and the likelihood in terms of ‘competing’

with the alternative hypothesis ¯e – the larger they are relative to the prior for ¯e and

the likelihood for ¯e, then the larger our posterior belief in e will be.

Because probabilities can be very small quantities and addition is often easier to

work with than multiplication, it is common to take logarithms of both sides, so that

log P(e | D) = log P(D | e) + log P(e)− log P(D). (1.2)

To apply Equation (1.1) or (1.2) to any class of models, we only need to specify the

prior P(e)and the data likelihood P(D | e).

Having updated our degree of belief in e, from P(e)to P(e | D), we can continue

this process and incorporate more data as they become available. For example, we

might later obtain more data on the size of the Web from a different study – call this

second data set D

. We can use Bayes’ rule to write

P(e | D, D

) =

P(D

| e, D)P(e | D)

P(D

| D)

. (1.3)

Comparing Equations (1.3) and (1.2) we see that the old posterior P(e | D) plays the

role of the new prior when data set D

arrives.

4 PARAMETER ESTIMATION FROM DATA

The use of priors is a strength of the Bayesian approach, since it allows the incor-

poration of prior knowledge and constraints into the modeling process. In general,

the effects of priors diminish as the number of data points increases. Formally, this is

because the log-likelihood log P(D | e) typically increases linearly with the number

of data points in D, while the prior log P(e)remains constant. Finally, and most impor-

tantly, the effects of different priors, as well as different models and model classes,

can be assessed within the Bayesian framework by comparing the corresponding

probabilities.

The computation of the likelihood is of course model dependent and is not addressed

here in its full generality. Later in the chapter we will brieﬂy look at a variety of

graphical model and mixture model techniques that act as ‘components’ in a ‘ﬂexible

toolbox’ for the construction of different types of likelihood functions.

1.2 Parameter Estimation from Data

1.2.1 Basic principles

We can now turn to the main type of inference that will be used throughout this book,

namely estimation of parameters θ under the assumption of a particular functional

form for a model M. For example, if our model M is the Gaussian distribution, then

we have two parameters: the mean and standard deviation of this Gaussian.

In what follows below we will refer to priors, likelihoods, and posteriors relating

to sets of parameters θ. Typically our parameters θ are a set of real-valued numbers.

Thus, both the prior P(θ) and the posterior P(θ | D) are deﬁning probability density

functions over this set of real-valued numbers. For example, our prior density might

assert that a particular parameter (such as the standard deviation) must be positive, or

that the mean is likely to lie within a certain range on the real line. From a Bayesian

viewpoint our prior and posterior reﬂect our degree of belief in terms of where θ lies,

given the relevant evidence. For example, for a single parameter θ , P(θ > 0 | D) is

the posterior probability that θ is greater than zero, given the evidence provided by the

data. For simplicity of notation we will not always explicitly include the conditioning

term for the model M, P(θ | M) or P(θ | D, M), but instead write expressions such

as P(θ) which implicitly assume the presence of a model M with some particular

functional form.

The general objective of parameter estimation is to ﬁnd or approximate the ‘best’

set of parameters for a model – that is, to ﬁnd the set of parameters θ maximizing the

posterior P(θ | D), or log P(θ | D). This is called maximum a posteriori (MAP)

estimation.

In order to deal with positive quantities, we can minimize −log P(θ | D):

E (θ ) =−log P(θ | D) =−log P(D | θ) − log P(θ) + log P(D). (1.4)

From an optimization standpoint, the logarithm of the prior P(θ) plays the role of a

regularizer, that is, of an additional penalty term that can be used to enforce additional

剩余295页未读，继续阅读

qinhuah

粉丝: 0
资源: 5

会员权益专享

互联网与网页建模：概率方法与算法解析

Modeling the Internet and the Web: Probabilistic Methods and Algorithms

Learning.Probabilistic.Graphical.Models.in.R.17843920

Machine Learning: A Probabilistic Perspective

Haiyan+Luo,+Linzhong+Liu,+Xun+Yang.+Bi-level+programming+problem+in+the+supply+chain+and+its+solutio

topic modeling matlab

opencascade modeling algorithms.

Markov Random Fields

how can you classify titles by the similarities of titles via python

Introduce the differences between GPT and BERT models

与Prediction and risk assessment of extreme weather events based on gumbel copula function类似的文献

Computational electromagnetics: the finite-difference time-domain method

write me a ultrasound channel data generator

hierarchical Bayesian model

introduction to the modeling and analysis of complex systems

ImportError: cannot import name 'modeling'

class DataPreprocessApp:

数模美赛abc对应算法

会员权益专享

最新资源