没有合适的资源?快使用搜索试试~ 我知道了~
首页互联网与网页建模:概率方法与算法解析
互联网与网页建模:概率方法与算法解析
需积分: 9 20 下载量 117 浏览量
更新于2023-07-05
收藏 2.11MB PDF 举报
"Modeling the Internet and the Web: Probabilistic Methods and Algorithms" 是一本由 Pierre Baldi, Paolo Frasconi 和 Padhraic Smyth 合著的书籍,该书探讨了如何使用概率方法和算法来理解和分析互联网与网页的复杂性。 这本书的核心在于强调在面对互联网和网络数据的不确定性、不完整性时,概率论和统计学的重要性。由于互联网是一个巨大的分布式、去中心化、自我组织且不断演化的系统,因此它产生的数据往往带有噪声和不确定性。作者指出,概率方法不仅有助于处理这些噪声测量问题,而且许多底层现象——如互联网和网页的动态演变——本质上就是概率性的。这与统计力学中的系统相似,大量微小因素的随机互动可能产生规律性模式,而这些模式只能通过概率方法来捕捉。 此外,互联网作为一个高维度系统,其很多变量是无法完全测量的,大部分变量处于隐藏状态,需要通过概率方法进行推断和“因子分解”。在信息检索、机器翻译等应用中,概率方法已经取得了显著成就,并被广泛使用。例如,我们日常使用的搜索引擎就基于这些方法,尽管它们并不完全理解语义,但能有效地处理和利用信息。 书中详细阐述了概率方法如何应用于互联网和网页建模分析的各个领域,包括网络流量分析、图形结构研究、信息检索引擎的设计以及用户行为的建模。通过对大量数据集的统计分析,可以提取出这些领域的规律性,进一步推动相关技术的发展。 这本书为理解互联网和网页的复杂动态提供了概率论和统计学的理论基础,对于研究者和从业人员来说,是一本深入了解这一领域不可或缺的参考资料。
资源详情
资源推荐
![](https://csdnimg.cn/release/download_crawler_static/2819494/bg10.jpg)
PREFACE xix
NASA and the Jet Propulsion Laboratory, the Department of Energy and Lawrence
Livermore National Laboratory, IBM Research, Microsoft Research, Sun Microsys-
tems, HNC Software, the University of California MICRO Program, and a Laurel
Wilkening Faculty Innovation Award. Part of the book was written while P.F. was
visiting the School of Information and Computer Science (ICS) at UCI, with partial
funding provided by the University of California. We also would like to acknowledge
the general support we have received from the Institute for Genomics and Bioinfor-
matics (IGB) at UCI and the California Institute for Telecommunications and Infor-
mation Technology (Cal(IT)
2
). Within IGB, special thanks go the staff, in particular
Suzanne Knight, Janet Ko, Michele McCrea, and Ann Marie Walker. We would like
to acknowledge feedback and general support from many members of Cal(IT)
2
and
thank in particular its directors Bill Parker, Larry Smarr, Peter Rentzepis, and Ramesh
Rao, and staff members Catherine Hammond, Ashley Larsen, Doug Ramsey, Stuart
Ross, and Stephanie Sides. We thank a number of colleagues for discussions and
feedback on various aspects of the Web and probabilistic modeling: David Eppstein,
Chen Li, Sharad Mehrotra, and Mike Pazzani at UC Irvine, as well as Albert-László
Barabási, Nicoló Cesa-Bianchi, Monica Bianchini, C. Lee Giles, Marco Gori, David
Heckerman, David Madigan, Marco Maggini, Heikki Mannila, Chris Meek, Amnon
Meyers, Ion Muslea, Franco Scarselli, Giovanni Soda, and Steven Scott. We thank all
of the people who have helped with simulations or provided feedback on the various
versions of this manuscript, especially our students Gianluca Pollastri, Alessandro
Vullo, Igor Cadez, Jianlin Chen, Xianping Ge, Joshua O’Madadhain, Scott White,
Alessio Ceroni, Fabrizio Costa, Michelangelo Diligenti, Sauro Menchetti, and Andrea
Passerini. We also acknowledge Jean-Pierre Nadal, who brought to our attention the
Fermi model of power laws. We thank Xinglian Yie and David O’Hallaron for pro-
viding their data on search engine queries in Chapter 8. We also thank the staff from
John Wiley & Sons, Ltd, in particular Senior Editor Sian Jones and Robert Calver,
and Emma Dain at T
&
T Productions Ltd. Finally, we acknowledge our families and
friends for their support in the writing of this book.
Pierre Baldi, Paolo Frasconi and Padhraic Smyth
October 2002, Irvine, CA
![](https://csdnimg.cn/release/download_crawler_static/2819494/bg11.jpg)
1
Mathematical Background
In this chapter we review a number of basic concepts in probabilistic modeling and
data analysis that are used throughout the book, including parameter estimation, mix-
ture models, graphical models, classification, clustering, and power-law distributions.
Each of these topics is worthy of an entire chapter (or even a whole book) by itself, so
our treatment is necessarily brief. Nonetheless, the goal of this chapter is to provide
the introductory foundations for models and techniques that will be widely used in
the remainder of the book. Readers who are already familiar with these concepts, or
who want to avoid mathematical details during a first reading, can safely skip to the
next chapter. More specific technical mathematical concepts or mathematical com-
plements that concern only a specific chapter rather than the entire book are given in
Appendix A.
1.1 Probability and Learning from a Bayesian
Perspective
Throughout this book we will make frequent use of probabilistic models to charac-
terize various phenomena related to the Web. Both theory and experience have shown
that probability is by far the most useful framework currently available for modeling
uncertainty. Probability allows us to reason in a coherent manner about events and
make inferences about such events given observed data. More specifically, an event e
is a proposition or statement about the world at large. For example, let e be the propo-
sition that ‘the number of Web pages in existence on 1 January 2003 was greater than
five billion’. A well-defined proposition e is either true or false – by some reasonable
definition of what constitutes a Web page, the total number that existed in January
2003 was either greater than five billion or not. There is, however, considerable uncer-
tainty about what this number was back in January 2003 since, as we will discuss later
in Chapters 2 and 3, accurately estimating the size of the Web is a quite challenging
problem. Consequently there is uncertainty about whether the proposition e is true or
not.
Modeling the Internet and the Web: Probabilistic Methods and Algorithms.
P. Baldi, P. Frasconi and P. Smyth
Copyright
2003 P. Baldi, P. Frasconi and P. Smyth.
Published by John Wiley & Sons, Ltd.
ISBN: 0-470-84906-1
![](https://csdnimg.cn/release/download_crawler_static/2819494/bg12.jpg)
2 PROBABILITY AND LEARNING FROM A BAYESIAN PERSPECTIVE
A probability, P(e), can be viewed as a number that reflects our uncertainty about
whether e is true or false in the real world, given whatever information we have avail-
able. This is known as the ‘degree of belief’ or Bayesian interpretation of probability
(see, for instance, Berger 1985; Box and Tiao 1992; Cox 1964; Gelman et al. 1995;
Jaynes 2003) and is the one that we will use by default throughout this text. In fact,
to be more precise, we should use a conditional probability P(e | I) in general to
represent degree of belief, where I is the background information on which our belief
is based. For simplicity of notation we will often omit this conditioning on I, but it
may be useful to keep in mind that everywhere you see a P(e) for some proposition
e, there is usually some implicit background information I that is known or assumed
to be true.
The Bayesian interpretation of a probability P(e) is a generalization of the more
classic interpretation of a probability as the relative frequency of successes to total
trials, estimated over an infinite number of hypothetical repeated trials (the so-called
‘frequentist’ interpretation). The Bayesian interpretation is more useful in general,
since it allows us to make statements about propositions such as ‘the number of Web
pages in existence’ where a repeated trials interpretation would not necessarily apply.
It can be shown that, under a small set of reasonable axioms, degrees of belief can
be represented by real numbers and that when rescaled to the [0, 1] interval these
degrees of confidence must obey the rules of probability and, in particular, Bayes’
theorem (Cox 1964; Jaynes 1986, 2003; Savage 1972). This is reassuring, since it
means that the standard rules of probability still apply whether we are using the degree
of belief interpretation or the frequentist interpretation. In other words, the rules for
manipulating probabilities such as conditioning or the law of total probability remain
the same no matter what semantics we attach to the probabilities.
The Bayesian approach also allows us to think about probability as being a dynamic
entity that is updated as more data arrive – as we receive more data we may naturally
change our degree of belief in certain propositions given these new data. Thus, for
example, we will frequently refer to terms such as P(e | D) where D is some data.
In fact, by Bayes’ theorem,
P(e | D) =
P(D | e)P (e)
P(D)
. (1.1)
The interpretation of each of the terms in this equation is worth discussing. P(e)is your
belief in the event e before you see any data at all, referred to as your prior probability
for e or prior degree of belief in e. For example, letting e again be the statement that
‘the number of Web pages in existence on 1 January 2003 was greater than five billion’,
P(e)reflects your degree of belief that this statement is true. Suppose you now receive
some data D which is the number of pages indexed by various search engines as of
1 January 2003. To a reasonable approximation we can view these numbers as lower
bounds on the true number and let’s say for the sake of argument that all the numbers
are considerably less than five billion. P(e | D) now reflects your updated posterior
belief in e given the observed data and it can be calculated by using Bayes’theorem via
![](https://csdnimg.cn/release/download_crawler_static/2819494/bg13.jpg)
MATHEMATICAL BACKGROUND 3
Equation (1.1). The right-hand side of Equation (1.1) includes the prior, so naturally
enough the posterior is proportional to the prior.
The right-hand side also includes P(D | e), which is known as the likelihood of the
data, i.e. the probability of the data under the assumption that e is true. To calculate the
likelihood we must have a probabilistic model that connects the proposition e we are
interested in with the observed data D – this is the essence of probabilistic learning.
For our Web page example, this could be a model that puts a probability distribution
on the number of Web pages that each search engine may find if the conditioning event
is true, i.e. if there are in fact more than five billion Web pages in existence. This could
be a complex model of how the search engines actually work, taking into account all
the various reasons that many pages will not be found, or it might be a very simple
approximate model that says that each search engine has some conditional distribution
on the number of pages that will be indexed, as a function of the total number that
exist. Appendix A provides examples of several standard probability models – these
are in essence the ‘building blocks’ for probabilistic modeling and can be used as
components either in likelihood models P(D | e) or as priors P(e).
Continuing with Equation (1.1), the likelihood expression reflects how likely the
observed data are, given e and given some model connecting e and the data. If P(D | e)
is very low, this means that the model is assigning a low probability to the observed
data. This might happen, for example, if the search engines hypothetically all reported
numbers of indexed pages in the range of a few million rather than in the billion range.
Of course we have to factor in the alternative hypothesis, ¯e, here and we must ensure
that both P(e)+ P(¯e) = 1 and P(e | D) +P(¯e | D) = 1 to satisfy the basic axioms
of probability. The ‘normalization’ constant in the denominator of Equation (1.1) can
be calculated by noting that P(D) = P(D | e)P (e) + P(D |¯e)P(¯e). It is easy to see
that P(e | D) depends both on the prior and the likelihood in terms of ‘competing’
with the alternative hypothesis ¯e – the larger they are relative to the prior for ¯e and
the likelihood for ¯e, then the larger our posterior belief in e will be.
Because probabilities can be very small quantities and addition is often easier to
work with than multiplication, it is common to take logarithms of both sides, so that
log P(e | D) = log P(D | e) + log P(e)− log P(D). (1.2)
To apply Equation (1.1) or (1.2) to any class of models, we only need to specify the
prior P(e)and the data likelihood P(D | e).
Having updated our degree of belief in e, from P(e)to P(e | D), we can continue
this process and incorporate more data as they become available. For example, we
might later obtain more data on the size of the Web from a different study – call this
second data set D
2
. We can use Bayes’ rule to write
P(e | D, D
2
) =
P(D
2
| e, D)P(e | D)
P(D
2
| D)
. (1.3)
Comparing Equations (1.3) and (1.2) we see that the old posterior P(e | D) plays the
role of the new prior when data set D
2
arrives.
![](https://csdnimg.cn/release/download_crawler_static/2819494/bg14.jpg)
4 PARAMETER ESTIMATION FROM DATA
The use of priors is a strength of the Bayesian approach, since it allows the incor-
poration of prior knowledge and constraints into the modeling process. In general,
the effects of priors diminish as the number of data points increases. Formally, this is
because the log-likelihood log P(D | e) typically increases linearly with the number
of data points in D, while the prior log P(e)remains constant. Finally, and most impor-
tantly, the effects of different priors, as well as different models and model classes,
can be assessed within the Bayesian framework by comparing the corresponding
probabilities.
The computation of the likelihood is of course model dependent and is not addressed
here in its full generality. Later in the chapter we will briefly look at a variety of
graphical model and mixture model techniques that act as ‘components’ in a ‘flexible
toolbox’ for the construction of different types of likelihood functions.
1.2 Parameter Estimation from Data
1.2.1 Basic principles
We can now turn to the main type of inference that will be used throughout this book,
namely estimation of parameters θ under the assumption of a particular functional
form for a model M. For example, if our model M is the Gaussian distribution, then
we have two parameters: the mean and standard deviation of this Gaussian.
In what follows below we will refer to priors, likelihoods, and posteriors relating
to sets of parameters θ. Typically our parameters θ are a set of real-valued numbers.
Thus, both the prior P(θ) and the posterior P(θ | D) are defining probability density
functions over this set of real-valued numbers. For example, our prior density might
assert that a particular parameter (such as the standard deviation) must be positive, or
that the mean is likely to lie within a certain range on the real line. From a Bayesian
viewpoint our prior and posterior reflect our degree of belief in terms of where θ lies,
given the relevant evidence. For example, for a single parameter θ , P(θ > 0 | D) is
the posterior probability that θ is greater than zero, given the evidence provided by the
data. For simplicity of notation we will not always explicitly include the conditioning
term for the model M, P(θ | M) or P(θ | D, M), but instead write expressions such
as P(θ) which implicitly assume the presence of a model M with some particular
functional form.
The general objective of parameter estimation is to find or approximate the ‘best’
set of parameters for a model – that is, to find the set of parameters θ maximizing the
posterior P(θ | D), or log P(θ | D). This is called maximum a posteriori (MAP)
estimation.
In order to deal with positive quantities, we can minimize −log P(θ | D):
E (θ ) =−log P(θ | D) =−log P(D | θ) − log P(θ) + log P(D). (1.4)
From an optimization standpoint, the logarithm of the prior P(θ) plays the role of a
regularizer, that is, of an additional penalty term that can be used to enforce additional
剩余295页未读,继续阅读
![](https://profile-avatar.csdnimg.cn/default.jpg!1)
qinhuah
- 粉丝: 0
- 资源: 5
上传资源 快速赚钱
我的内容管理 收起
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![](https://csdnimg.cn/release/wenkucmsfe/public/img/voice.245cc511.png)
会员权益专享
最新资源
- 利用迪杰斯特拉算法的全国交通咨询系统设计与实现
- 全国交通咨询系统C++实现源码解析
- DFT与FFT应用:信号频谱分析实验
- MATLAB图论算法实现:最小费用最大流
- MATLAB常用命令完全指南
- 共创智慧灯杆数据运营公司——抢占5G市场
- 中山农情统计分析系统项目实施与管理策略
- XX省中小学智慧校园建设实施方案
- 中山农情统计分析系统项目实施方案
- MATLAB函数详解:从Text到Size的实用指南
- 考虑速度与加速度限制的工业机器人轨迹规划与实时补偿算法
- Matlab进行统计回归分析:从单因素到双因素方差分析
- 智慧灯杆数据运营公司策划书:抢占5G市场,打造智慧城市新载体
- Photoshop基础与色彩知识:信息时代的PS认证考试全攻略
- Photoshop技能测试:核心概念与操作
- Photoshop试题与答案详解
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
![](https://img-home.csdnimg.cn/images/20220527035711.png)
![](https://img-home.csdnimg.cn/images/20220527035711.png)
![](https://img-home.csdnimg.cn/images/20220527035111.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![](https://csdnimg.cn/release/wenkucmsfe/public/img/green-success.6a4acb44.png)