Machine Learning The Art and Science of Algorithms thatMake Sense of Data

Machine

Learning

4星 · 超过85%的资源需积分: 10 138 浏览量更新于2023-06-06 收藏 1.34MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Machine Learning

The Art and Science of Algorithms

that Make Sense of Data

Peter Flach

Excerpts:

Background material

and literature references

4 Prologue: A machine learning sampler

There are a number of useful ways in which we can express the SpamAssassin

classiﬁer in mathematical notation. If we denote the result of the i -th test for

a given e-mail as x

, where x

= 1 if the test succeeds and 0 otherwise, and we

denote the weight of the i -th test as w

, then the total score of an e-mail can be

expressed as

i=1

, making use of the fact that w

contributes to the sum

only if x

= 1, i.e., if the test succeeds for the e-mail. Using t for the threshold

above which an e-mail is classiﬁed as spam (5 in our example), the ‘decision rule’

can be written as

i=1

>t.

Notice that the left-hand side of this inequality is linear in the x

variables, which

essentially means that increasing one of the x

by a certain amount, say ±, will

change the sum by an amount (w

±) that is independent of the value of x

. This

wouldn’t be true if x

appeared squared in the sum, or with any exponent other

than 1.

The notation can be simpliﬁed by means of linear algebra, writing w for the vec-

tor of weights (w

,...,w

) and x for the vector of test results (x

,...,x

). The

above inequality can then be written using a dot product: w·x > t . Changing the

inequality to an equality w ·x = t, we obtain the ‘decision boundary’, separating

spam from ham. The decision boundary is a plane (a ‘straight’ surface) in the

space spanned by the x

variables because of the linearity of the left-hand side.

The vector w is perpendicular to this plane and points in the direction of spam.

Figure 1 visualises this for two variables.

It is sometimes convenient to simplify notation further by introducing an ex-

tra constant ‘variable’ x

= 1, the weight of which is ﬁxed to w

=°t. The ex-

tended data point is then x

= (1,x

,...,x

) and the extended weight vector is

= (°t, w

,...,w

), leading to the decision rule w

·x

> 0 and the decision

boundary w

·x

= 0. Thanks to these so-called homogeneous coordinates the

decision boundary passes through the origin of the extended coordinate system,

at the expense of needing an additional dimension (but note that this doesn’t re-

ally affect the data, as all data points and the ‘real’ decision boundary live in the

plane x

=1).

Background 1. SpamAssassin in mathematical notation. In boxes such as these, I will

brieﬂy remind you of useful concepts and notation. If some of these are unfamiliar, you

will need to spend some time reviewing them – using other books or online resources such

as www.wikipedia.org or mathworld.wolfram.com – to fully appreciate the rest

of the book.

8 Prologue: A machine learning sampler

Probabilities involve ‘random variables’ that describe outcomes of ‘events’. These events

are often hypothetical and therefore probabilities have to be estimated. For example, con-

sider the statement ‘42% of the UK population approves of the current Prime Minister’.

The only way to know this for certain is to ask everyone in the UK, which is of course

unfeasible. Instead, a (hopefully representative) sample is queried, and a more correct

statement would then be ‘42% of a sample drawn from the UK population approves of the

current Prime Minister’, or ‘the proportion of the UK population approving of the current

Prime Minister is estimated at 42%’. Notice that these statements are formulated in terms

of proportions or ‘relative frequencies’; a corresponding statement expressed in terms of

probabilities would be ‘the probability that a person uniformly drawn from the UK popu-

lation approves of the current Prime Minister is estimated at 0.42’. The event here is ‘this

random person approves of the PM’.

The ‘conditional probability’ P(A|B) is the probability of event A happening given that

event B happened. For instance, the approval rate of the Prime Minister may differ for

men and women. Writing P (PM) for the probability that a random person approves of the

Prime Minister and P (PM|woman) for the probability that a random woman approves of

the Prime Minister, we then have that P(PM|woman) =P (PM, woman)/P (woman), where

P(PM,woman) is the probability of the ‘joint event’ that a random person both approves

of the PM and is a woman, and P(woman) is the probability that a random person is a

woman (i.e., the proportion of women in the UK population).

Other useful equ ations include P(A, B ) = P(A|B)P(B) = P (B|A)P(A) and P(A|B) =

P(B|A)P (A)/P(B). The latter is known as ‘Bayes’ rule’ and will play an impor-

tant role in this book. Notice that many of these equations can be extended to

more than two random variables, e.g. the ‘chain rule of probability’: P (A,B,C, D) =

P(A|B,C , D)P(B|C, D)P(C|D)P(D).

Two events A and B are independent if P(A|B ) = P(A), i.e., if knowing that B happened

doesn’t change the probability of A happening. An equivalent formulation is P(A,B) =

P(A)P (B). In general, multiplying probabilities involves the assumption that the corre-

sponding events are independent.

The ‘odds’ of an event is the ratio of the probability that the event happens and the proba-

bility that it doesn’t happen. That is, if the probability of a particular event happening is p,

then the corresponding odds are o = p/(1 °p). Conversely, we have that p =o/(o +1). So,

for example, a probability of 0.8 corresponds to odds of 4:1, the opposite odds of 1:4 give

probability 0.2, and if the event is as likely to occur as not then the probability is 0.5 and

the odds are 1:1. While we will most often use the probability scale, odds are sometimes

more convenient because they are expressed on a multiplicative scale.

Background 2. The basics of probability.

20 1. The ingredients of machine learning

Long before machine learning came into existence, philosophers knew that gen-

eralising from particular cases to general rules is not a well-posed problem with

well-deﬁned solutions. Such inference by generalisation is called induction and

is to be contrasted with deduction, which is the kind of reasoning that applies to

problems with well-deﬁned correct solutions. There are many versions of this so-

called problem of induction. One version is due to the eighteenth-century Scot-

tish philosopher David Hume, who claimed that the only justiﬁcation for induc-

tion is itself inductive: since it appears to work for certain inductive problems, it

is expected to work for all inductive problems. This doesn’t just say that induc-

tion cannot be deductively justiﬁed but that its justiﬁcation is circular, which is

much worse.

A related problem is stated by the no free lunch theorem, which states that no

learning algorithm can outperform another when evaluated over all possible

classiﬁcation problems, and thus the performance of any learning algorithm,

over the set of all possible learning problems, is no better than random guess-

ing. Consider, for example, the ‘guess the next number’ questions popular in

psychological tests: what comes after 1, 2, 4, 8, ...? If all number sequences are

equally likely, then there is no hope that we can improve – on average – on ran-

dom guessing (I personally always answer ‘42’ to such questions). Of course,

some sequences are very much more likely than others, at least in the world of

psychological tests. Likewise, the distribution of learning problems in the real

world is highly non-uniform. The way to escape the curse of the no free lunch

theorem is to ﬁnd out more about this distribution and exploit this knowledge in

our choice of learning algorithm.

Background 1.1. Problems of induction and free lunches.

1.2 Models: the output of machine learning

Models form the central concept in machine lear ning as they are what is being learned

from the data, in order to solve a given task. There is a considerable – not to say be-

wildering – range of machine learning models to choose from. One reason for this is

the ubiqu ity of the tasks that machine learning aims to solve: classiﬁcation, regres-

sion, clustering, association discovery, to name but a few. Examples of each of these

tasks can be found in virtually every branch of science and engineering. Mathemati-

cians, engineers, psychologists, computer scientists and many others have discovered

– and sometimes rediscovered – ways to solve these tasks. They have all brought their

24 1. The ingredients of machine learning

Transformations in d-dimensional Cartesian coordinate systems can be conve-

niently represented by means of matrix notation. Let x be a d -vector represent-

ing a data point, then x +t is the resulting point after translating over t (another

d-vector). Translating a set of points over t can be equivalently understood as

translating the origin over °t. Using homogeneous coordinates – the addition of

an extra dimension set to 1 – translations can be expressed by matrix multiplica-

tion: e.g., in two dimensions we have

T =

100

A rotation is deﬁned by any d -by-d matrix D whose transpose is its inverse (which

means it is orthogonal) and whose determinant is 1. In two dimensions a rotation

matrix can be written as R =

√

cosµ sinµ

°sinµ cosµ

, representing a clockwise rotation

over angle µ about the origin. For instance,

√

°10

is a 90 degrees clockwise

rotation.

A scaling is deﬁned by a diagonal matrix; in two dimensions S =

√

0 s

uniform scaling applies the same scaling factor s in all dimensions and can be

written as sI, where I is the identity matrix. Notice that a uniform scaling with

scaling factor °1 is a rotation (over 180 degrees in the two-dimensional case).

A common scenario which utilises all these transformations is the following.

Given an n-by-d matr ix X representing n data points in d-di mensional space,

we ﬁrst calculate the centre of mass or mean vector µ by averaging each column.

We then zero-centre the data set by subtracting °µ from each row, which corre-

sponds to a translation. Next, we rotate the data such that as much variance (a

measure of the data’s ‘spread’ in a certain direction) as possible is aligned with

our coordinate axes; this can be achieved by a matrix transformation known as

tprincipal component analysis, about which you will learn more in Chapter 10.

Finally, we scale the data to unit variance along each coordinate.

Background 1.2. Linear transformations.

rather than being derived from a global model built from the entire data set.

There is a nice relationship between Euclidean distance and the mean of a set of

剩余80页未读，继续阅读

weishigonamego

粉丝: 0
资源: 2

会员权益专享

Machine Learning The Art and Science of Algorithms thatMake Sens...

会员权益专享

最新资源

Machine Learning The Art and Science of Algorithms thatMake Sens...

machine learning the art and science of algorithms that make sense of data PPT

Machine Learning The Art and Science of Algorithms that Make Sense of-slider

machine learning the art and science of algorithms

Pattern Recognition and Machine Learning-01-Preface

已知文本”The more the data, the better the performance of machine learning algorithms.”。统计文本中每个单词出现的次数

Statistics and Machine Learning Toolbox

Machine Learning specific learning route

Genetic Algorithms in Search, Optimization and Machine Learning

Clinical Data Classification of Type 2 Diabetes Based on Machine Learning

tell me about feature engineering in machine learning

The concept of deep learning

tell me about how to reprocess data in machine learning

Machine learning knowledge outline

how are machine learning used in math proof?

inductive learning conplete and consistent

machine learning algorithm

Help me write a literature review of the research on path planning algorithms for satellite maps

hands-on machine learning with scikit-learn, keras & tensorflow

Automated Machine Learning: Methods, Systems, Challenges

会员权益专享

最新资源