Chapter 2
Probability
2.1 Frequentists vs. Bayesians
what is probability?
One is called the frequentist interpretation. In this
view, probabilities represent long run frequencies of
events. For example, the above statement means that, if
we flip the coin many times, we expect it to land heads
about half the time.
The other interpretation is called the Bayesian inter-
pretation of probability. In this view, probability is used
to quantify our uncertainty about something; hence it is
fundamentally related to information rather than repeated
trials (Jaynes 2003). In the Bayesian view, the above state-
ment means we believe the coin is equally likely to land
heads or tails on the next toss
One big advantage of the Bayesian interpretation is
that it can be used to model our uncertainty about events
that do not have long term frequencies. For example, we
might want to compute the probability that the polar ice
cap will melt by 2020 CE. This event will happen zero
or one times, but cannot happen repeatedly. Nevertheless,
we ought to be able to quantify our uncertainty about this
event. To give another machine learning oriented exam-
ple, we might have observed a blip on our radar screen,
and want to compute the probability distribution over the
location of the corresponding target (be it a bird, plane,
or missile). In all these cases, the idea of repeated trials
does not make sense, but the Bayesian interpretation is
valid and indeed quite natural. We shall therefore adopt
the Bayesian interpretation in this book. Fortunately, the
basic rules of probability theory are the same, no matter
which interpretation is adopted.
2.2 A brief review of probability theory
2.2.1 Basic concepts
We denote a random event by defining a random variable
X.
Descrete random variable: X can take on any value
from a finite or countably infinite set.
Continuous random variable: the value of X is real-
valued.
2.2.1.1 CDF
F(x) ≜ P(X ≤ x) =
∑
u≤x
p(u) , discrete
x
−∞
f (u)du , continuous
(2.1)
2.2.1.2 PMF and PDF
For descrete random variable, We denote the probability
of the event that X = x by P(X = x), or just p(x) for
short. Here p(x) is called a probability mass function
or PMF.A probability mass function is a function that
gives the probability that a discrete random variable is ex-
actly equal to some value
4
. This satisfies the properties
0 ≤ p(x) ≤ 1 and
∑
x∈X
p(x) = 1.
For continuous variable, in the equation
F(x) =
x
−∞
f (u)du, the function f (x) is called a
probability density function or PDF. A probability
density function is a function that describes the rela-
tive likelihood for this random variable to take on a
given value
5
.This satisfies the properties f (x) ≥ 0 and
∞
−∞
f (x)dx = 1.
2.2.2 Mutivariate random variables
2.2.2.1 Joint CDF
We denote joint CDF by F(x,y) ≜ P(X ≤ x ∩Y ≤ y) =
P(X ≤ x,Y ≤ y).
F(x,y) ≜ P(X ≤ x,Y ≤ y) =
∑
u≤x,v≤y
p(u,v)
x
−∞
y
−∞
f (u,v)dudv
(2.2)
product rule:
p(X,Y ) = P(X|Y )P(Y ) (2.3)
Chain rule:
4
http://en.wikipedia.org/wiki/Probability_
mass_function
5
http://en.wikipedia.org/wiki/Probability_
density_function
3