8 Prologue: A machine learning sampler
Probabilities involve ‘random variables’ that describe outcomes of ‘events’. These events
are often hypothetical and therefore probabilities have to be estimated. For example, con-
sider the statement ‘42% of the UK population approves of the current Prime Minister’.
The only way to know this for certain is to ask everyone in the UK, which is of course
unfeasible. Instead, a (hopefully representative) sample is queried, and a more correct
statement would then be ‘42% of a sample drawn from the UK population approves of the
current Prime Minister’, or ‘the proportion of the UK population approving of the current
Prime Minister is estimated at 42%’. Notice that these statements are formulated in terms
of proportions or ‘relative frequencies’; a corresponding statement expressed in terms of
probabilities would be ‘the probability that a person uniformly drawn from the UK popu-
lation approves of the current Prime Minister is estimated at 0.42’. The event here is ‘this
random person approves of the PM’.
The ‘conditional probability’ P(A|B) is the probability of event A happening given that
event B happened. For instance, the approval rate of the Prime Minister may differ for
men and women. Writing P (PM) for the probability that a random person approves of the
Prime Minister and P (PM|woman) for the probability that a random woman approves of
the Prime Minister, we then have that P(PM|woman) =P (PM, woman)/P (woman), where
P(PM,woman) is the probability of the ‘joint event’ that a random person both approves
of the PM and is a woman, and P(woman) is the probability that a random person is a
woman (i.e., the proportion of women in the UK population).
Other useful equ ations include P(A, B ) = P(A|B)P(B) = P (B|A)P(A) and P(A|B) =
P(B|A)P (A)/P(B). The latter is known as ‘Bayes’ rule’ and will play an impor-
tant role in this book. Notice that many of these equations can be extended to
more than two random variables, e.g. the ‘chain rule of probability’: P (A,B,C, D) =
P(A|B,C , D)P(B|C, D)P(C|D)P(D).
Two events A and B are independent if P(A|B ) = P(A), i.e., if knowing that B happened
doesn’t change the probability of A happening. An equivalent formulation is P(A,B) =
P(A)P (B). In general, multiplying probabilities involves the assumption that the corre-
sponding events are independent.
The ‘odds’ of an event is the ratio of the probability that the event happens and the proba-
bility that it doesn’t happen. That is, if the probability of a particular event happening is p,
then the corresponding odds are o = p/(1 °p). Conversely, we have that p =o/(o +1). So,
for example, a probability of 0.8 corresponds to odds of 4:1, the opposite odds of 1:4 give
probability 0.2, and if the event is as likely to occur as not then the probability is 0.5 and
the odds are 1:1. While we will most often use the probability scale, odds are sometimes
more convenient because they are expressed on a multiplicative scale.
Background 2. The basics of probability.