机器学习速查表：经典公式与图解指南

机器学习

需积分: 10 92 浏览量更新于2024-07-21 1 收藏 1.89MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

机器学习速查表是一份实用的参考文档，它深入浅出地介绍了机器学习这门多学科交叉领域的基础知识。该速查表由soulmachine创建，旨在帮助学习者快速回顾和理解经典机器学习中的公式、图解和技巧，提高学习效率。速查表的特点在于： 1. 严谨的符号标准：与编程语言相比，速查表中使用的数学公式更加注重符号的一致性。为了消除理解和混淆，作者尽可能标准化了符号的使用，例如将X明确定义为集合、随机变量或矩阵，这有助于读者准确把握每个概念的含义。 2. 丰富的括号使用：在传统机器学习文献中，作者常常省略括号，导致公式可能产生歧义。这份速查表中，作者注意到了这一点，通过适当的括号使用来增强表达的清晰度，确保读者能够正确解析复杂的数学关系。 3. 日期和许可：速查表发布于2014年5月26日，且遵循Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)协议，这意味着用户可以自由地引用和分享内容，只要保持原作者的署名和同样开放的许可证。 4. 内容覆盖广泛：这份速查表涵盖了经典机器学习中的诸多元素，包括但不限于线性代数、概率论、统计推断、优化方法（如梯度下降）、损失函数、正则化、神经网络、决策树、聚类算法、支持向量机等，这些都是机器学习算法设计和应用的核心组成部分。 5. 实用性：作为学习和复习工具，速查表的目的是帮助读者在面对复杂模型和理论时，能迅速找到关键公式和概念，便于在实际项目中运用和迭代。机器学习速查表是一份对初学者和专业人士都非常有用的参考资料，它通过标准化符号、清晰的表达方式和全面的内容覆盖，提升了理解和掌握机器学习技术的便捷性。无论是在学习阶段还是在解决实际问题时，这份速查表都能发挥重要的辅助作用。

资源详情

资源推荐

1.2.2.2 ERM and SRM

Deﬁnition 1.4. ERM(Empirical risk minimization)

min

f ∈F

emp

( f ) = min

f ∈F

∑

i=1

L (y

, f (x

)) (1.3)

Deﬁnition 1.5. Structural risk

smp

( f ) =

∑

i=1

L (y

, f (x

)) +

J( f ) (1.4)

Deﬁnition 1.6. SRM(Structural risk minimization)

min

f ∈F

srm

( f ) = min

f ∈F

∑

i=1

L (y

, f (x

)) +

J( f ) (1.5)

1.2.3 Optimization

Finally, we need a training algorithm(also called learn-

ing algorithm) to search among the classiﬁers in the the

hypothesis space for the highest-scoring one. The choice

of optimization technique is key to the efﬁciency of the

model.

1.3 Some basic concepts

1.3.1 Parametric vs non-parametric models

1.3.2 A simple non-parametric classiﬁer:

K-nearest neighbours

1.3.2.1 Representation

y = f (x) = argmin

∑

∈N

(x)

I(y

= c) (1.6)

where N

(x) is the set of k points that are closest to point

Usually use k-d tree to accelerate the process of ﬁnd-

ing k nearest points.

1.3.2.2 Evaluation

No training is needed.

1.3.2.3 Optimization

No training is needed.

1.3.3 Overﬁtting

1.3.4 Cross validation

Deﬁnition 1.7. Cross validation, sometimes called rota-

tion estimation, is a model validation technique for assess-

ing how the results of a statistical analysis will generalize

to an independent data set

Common types of cross-validation:

1. K-fold cross-validation. In k-fold cross-validation, the

original sample is randomly partitioned into k equal

size subsamples. Of the k subsamples, a single sub-

sample is retained as the validation data for testing the

model, and the remaining k 1 subsamples are used as

training data.

2. 2-fold cross-validation. Also, called simple cross-

validation or holdout method. This is the simplest

variation of k-fold cross-validation, k=2.

3. Leave-one-out cross-validation(LOOCV). k=M, the

number of original samples.

1.3.5 Model selection

When we have a variety of models of different complex-

ity (e.g., linear or logistic regression models with differ-

ent degree polynomials, or KNN classiﬁers with different

values ofK), how should we pick the right one? A natural

approach is to compute the misclassiﬁcation rate on the

training set for each method.

http://en.wikipedia.org/wiki/

Cross-validation_(statistics)

Chapter 2

Probability

2.1 Frequentists vs. Bayesians

what is probability?

One is called the frequentist interpretation. In this

view, probabilities represent long run frequencies of

events. For example, the above statement means that, if

we ﬂip the coin many times, we expect it to land heads

about half the time.

The other interpretation is called the Bayesian inter-

pretation of probability. In this view, probability is used

to quantify our uncertainty about something; hence it is

fundamentally related to information rather than repeated

trials (Jaynes 2003). In the Bayesian view, the above state-

ment means we believe the coin is equally likely to land

heads or tails on the next toss

One big advantage of the Bayesian interpretation is

that it can be used to model our uncertainty about events

that do not have long term frequencies. For example, we

might want to compute the probability that the polar ice

cap will melt by 2020 CE. This event will happen zero

or one times, but cannot happen repeatedly. Nevertheless,

we ought to be able to quantify our uncertainty about this

event. To give another machine learning oriented exam-

ple, we might have observed a blip on our radar screen,

and want to compute the probability distribution over the

location of the corresponding target (be it a bird, plane,

or missile). In all these cases, the idea of repeated trials

does not make sense, but the Bayesian interpretation is

valid and indeed quite natural. We shall therefore adopt

the Bayesian interpretation in this book. Fortunately, the

basic rules of probability theory are the same, no matter

which interpretation is adopted.

2.2 A brief review of probability theory

2.2.1 Basic concepts

We denote a random event by deﬁning a random variable

Descrete random variable: X can take on any value

from a ﬁnite or countably inﬁnite set.

Continuous random variable: the value of X is real-

valued.

2.2.1.1 CDF

F(x) ≜ P(X ≤ x) =



∑

u≤x

p(u) , discrete



−∞

f (u)du , continuous

(2.1)

2.2.1.2 PMF and PDF

For descrete random variable, We denote the probability

of the event that X = x by P(X = x), or just p(x) for

short. Here p(x) is called a probability mass function

or PMF.A probability mass function is a function that

gives the probability that a discrete random variable is ex-

actly equal to some value

. This satisﬁes the properties

0 ≤ p(x) ≤ 1 and

∑

x∈X

p(x) = 1.

For continuous variable, in the equation

F(x) =



−∞

f (u)du, the function f (x) is called a

probability density function or PDF. A probability

density function is a function that describes the rela-

tive likelihood for this random variable to take on a

given value

.This satisﬁes the properties f (x) ≥ 0 and



∞

−∞

f (x)dx = 1.

2.2.2 Mutivariate random variables

2.2.2.1 Joint CDF

We denote joint CDF by F(x,y) ≜ P(X ≤ x ∩Y ≤ y) =

P(X ≤ x,Y ≤ y).

F(x,y) ≜ P(X ≤ x,Y ≤ y) =



∑

u≤x,v≤y

p(u,v)



−∞



−∞

f (u,v)dudv

(2.2)

product rule:

p(X,Y ) = P(X|Y )P(Y ) (2.3)

Chain rule:

http://en.wikipedia.org/wiki/Probability_

mass_function

http://en.wikipedia.org/wiki/Probability_

density_function

p(X

1:N

) = p(X

)p(X

)...p(X

1:N−1

) (2.4)

2.2.2.2 Marginal distribution

Marginal CDF:

(x) ≜ F(x,+∞) =







∑

≤x

P(X = x

) =

∑

≤x

+∞

∑

j=1

P(X = x

,Y = y

)



−∞

(u)du =



−∞



+∞

−∞

f (u,v)dudv

(2.5)

(y) ≜ F(+∞,y) =











∑

≤y

p(Y = y

) =

+∞

∑

i=1

∑

≤y

P(X = x

,Y = y

)



−∞

(v)dv =



+∞

−∞



−∞

f (u,v)dudv

(2.6)

Marginal PMF and PDF:



P(X = x

) =

∑

+∞

j=1

P(X = x

,Y = y

) , descrete

(x) =



+∞

−∞

f (x,y)dy , continuous

(2.7)



p(Y = y

) =

∑

+∞

i=1

P(X = x

,Y = y

) , descrete

(y) =



+∞

−∞

f (x,y)dx , continuous

(2.8)

2.2.2.3 Conditional distribution

Conditional PMF:

p(X = x

|Y = y

) =

p(X = x

,Y = y

)

p(Y = y

)

if p(Y ) > 0 (2.9)

The pmf p(X|Y ) is called conditional probability.

Conditional PDF:

X|Y

(x|y) =

f (x,y)

(y)

(2.10)

2.2.3 Bayes rule

p(Y = y|X = x) =

p(X = x,Y = y)

p(X = x)

p(X = x|Y = y)p(Y = y)

∑

′

p(X = x|Y = y

′

)p(Y = y

′

)

(2.11)

2.2.4 Independence and conditional

independence

We say X and Y are unconditionally independent or

marginally independent, denoted X ⊥ Y , if we can

represent the joint as the product of the two marginals,

i.e.,

X ⊥Y = P(X,Y ) = P(X)P(Y ) (2.12)

We say X and Y are conditionally independent(CI)

given Z if the conditional joint can be written as a product

of conditional marginals:

X ⊥Y |Z = P(X,Y |Z) = P(X|Z)P(Y |Z) (2.13)

2.2.5 Quantiles

Since the cdf F is a monotonically increasing function,

it has an inverse; let us denote this by F

−1

. If F is the

cdf of X , then F

−1

(

) is the value of x

such that

P(X ≤ x

) =

; this is called the

quantile of F. The

value F

−1

(0.5) is the median of the distribution, with half

of the probability mass on the left, and half on the right.

The values F

−1

(0.25) and F

(0.75)are the lower and up-

per quartiles.

2.2.6 Mean and variance

The most familiar property of a distribution is its mean,or

expected value, denoted by

. For discrete rvs, it is de-

ﬁned as E[X] ≜

∑

x∈X

xp(x), and for continuous rvs, it is

deﬁned as E[X] ≜



xp(x)dx. If this integral is not ﬁnite,

the mean is not deﬁned (we will see some examples of

this later).

The variance is a measure of the spread of a distribu-

tion, denoted by

. This is deﬁned as follows:

var[X ] = E[(X −

)

] (2.14)



(x −

)

p(x)dx



p(x)dx +



p(x)dx −2



xp(x)dx

= E[X

] −

(2.15)

from which we derive the useful result

E[X

] =

(2.16)

The standard deviation is deﬁned as

std[X] ≜



var[X ] (2.17)

This is useful since it has the same units as X itself.

2.3 Some common discrete distributions

In this section, we review some commonly used paramet-

ric distributions deﬁned on discrete state spaces, both ﬁ-

nite and countably inﬁnite.

2.3.1 The Bernoulli and binomial

distributions

Deﬁnition 2.1. Now suppose we toss a coin only once.

Let X ∈{0,1}be a binary random variable, with probabil-

ity of success or heads of

. We say that X has a Bernoulli

distribution. This is written as X ∼ Ber(

), where the

pmf is deﬁned as

Ber(x|

) ≜

I(x=1)

(1 −

)

I(x=0)

(2.18)

Deﬁnition 2.2. Suppose we toss a coin n times. Let X ∈

{0,1,··· , n} be the number of heads. If the probability of

heads is

, then we say X has a binomial distribution,

written as X ∼ Bin(n,

). The pmf is given by

Bin(k|n,

) ≜





(1 −

)

n−k

(2.19)

2.3.2 The multinoulli and multinomial

distributions

Deﬁnition 2.3. The Bernoulli distribution can be

used to model the outcome of one coin tosses. To

model the outcome of tossing a K-sided dice, let

x = (I(x = 1),··· , I(x = K)) ∈ {0, 1}

be a random

vector(this is called dummy encoding or one-hot en-

coding), then we say X has a multinoulli distribution(or

categorical distribution), written as X ∼ Cat(

). The

pmf is given by:

p(x) ≜

∏

k=1

I(x

=1)

(2.20)

Deﬁnition 2.4. Suppose we toss a K-sided dice n times.

Let x = (x

,··· ,x

) ∈{0,1,··· ,n}

be a random vec-

tor, where x

is the number of times side j of the dice

occurs, then we say X has a multinomial distribution,

written as X ∼ Mu(n, θ). The pmf is given by

p(x) ≜



···x



∏

k=1

(2.21)

where



···x



≜

!···x

Bernoulli distribution is just a special case of a Bino-

mial distribution with n = 1, and so is multinoulli distri-

bution as to multinomial distribution. See Table 2.1 for a

summary.

Table 2.1: Summary of the multinomial and related

distributions.

Name K n X

Bernoulli 1 1 x ∈ {0,1}

Binomial 1 - x ∈ {0,1,··· , n}

Multinoulli - 1 x ∈{0,1}

∑

k=1

= 1

Multinomial - - x ∈ {0,1,··· ,n}

∑

k=1

= n

2.3.3 The Poisson distribution

Deﬁnition 2.5. We say that X ∈ {0,1,2,···} has a Pois-

son distribution with parameter

> 0, written as X ∼

Poi(

), if its pmf is

p(x|

) = e

−

(2.22)

The ﬁrst term is just the normalization constant, re-

quired to ensure the distribution sums to 1.

The Poisson distribution is often used as a model for

counts of rare events like radioactive decay and trafﬁc ac-

cidents.

2.3.4 The empirical distribution

The empirical distribution function

, or empirical cdf,

is the cumulative distribution function associated with the

empirical measure of the sample. Let D = {x

,··· ,x

}

be a sample set, it is deﬁned as

(x) ≜

∑

i=1

I(x

≤ x) (2.23)

http://en.wikipedia.org/wiki/Empirical_

distribution_function

Table 2.2: Summary of Bernoulli, binomial multinoulli and multinomial distributions.

Name Written as X p(x)(or p(x)) E[X] var[X]

Bernoulli X ∼ Ber(

) x ∈ {0,1}

I(x=1)

(1 −

)

I(x=0)

θ θ

(1 −

)

Binomial X ∼Bin(n,

) x ∈{0, 1,··· ,n}





(1 −

)

n−k

(1 −

)

Multinoulli X ∼ Cat(θ) x ∈ {0,1}

∑

k=1

= 1

∏

k=1

I(x

=1)

- -

Multinomial X ∼ Mu(n,θ) x ∈{0,1,··· ,n}

∑

k=1

= n



···x



∏

k=1

- -

Poisson X ∼ Poi(

) x ∈{0,1,2, ···} e

−

λ λ

2.4 Some common continuous distributions

In this section we present some commonly used univariate

(one-dimensional) continuous probability distributions.

2.4.1 Gaussian (normal) distribution

Table 2.3: Summary of Gaussian distribution.

Written as f (x) E[X] mode var[X]

X ∼ N (

)

√

πσ

−

(x−

)

µ µ σ

If X ∼ N(0,1),we say X follows a standard normal

distribution.

The Gaussian distribution is the most widely used dis-

tribution in statistics. There are several reasons for this.

1. First, it has two parameters which are easy to interpret,

and which capture some of the most basic properties of

a distribution, namely its mean and variance.

2. Second,the central limit theorem (Section TODO) tells

us that sums of independent random variables have an

approximately Gaussian distribution, making it a good

choice for modeling residual errors or noise.

3. Third, the Gaussian distribution makes the least num-

ber of assumptions (has maximum entropy), subject to

the constraint of having a speciﬁed mean and variance,

as we show in Section TODO; this makes it a good de-

fault choice in many cases.

4. Finally, it has a simple mathematical form, which re-

sults in easy to implement, but often highly effective,

methods, as we will see.

See (Jaynes 2003, ch 7) for a more extensive discussion

of why Gaussians are so widely used.

2.4.2 Student’s t-distribution

Table 2.4: Summary of Student’s t-distribution.

Written as f (x) E[X] mode var[X]

X ∼ T (

)

(

)

√

νπΓ

(

)



1 +



x −





µ µ

νσ

−2

where

(x) is the gamma function:

(x) ≜



∞

x−1

−t

dt (2.24)

is the mean,

> 0 is the scale parameter, and

> 0

is called the degrees of freedom. See Figure

2.1 for some

plots.

The variance is only deﬁned if

> 2. The mean is only

deﬁned if

> 1.

As an illustration of the robustness of the Student dis-

tribution, consider Figure 2.2. We see that the Gaussian

is affected a lot, whereas the Student distribution hardly

changes. This is because the Student has heavier tails, at

least for small

(see Figure 2.1).

= 1, this distribution is known as the Cauchy

or Lorentz distribution. This is notable for having such

heavy tails that the integral that deﬁnes the mean does not

converge.

To ensure ﬁnite variance, we require

> 2. It is com-

mon to use

= 4, which gives good performance in a

range of problems (Lange et al. 1989). For

≫ 5, the

Student distribution rapidly approaches a Gaussian distri-

bution and loses its robustness properties.

剩余132页未读，继续阅读

CJEQ

粉丝: 3
资源: 2

机器学习速查表：经典公式与图解指南

机器学习速查表.pdf

机器学习速查表：适用于Numpy，Pandas，Matplotlib，Scipy，Scikit Learn，ggplot2，TensorFlow，神经网络，Keras，深度学习的速查表

Python-机器学习速查表

机器学习速查表（Cheat Sheet）.7z

python机器学习速查表_cheat_sheet.rar

Python机器学习速查表：常用包与方法大全

CS229机器学习速查表：关键概念与运算指南

机器学习算法速查表

机器码速查表

机器学习、深度学习速查表，算法工程师必备

面向数据科学和机器学习初学者的速查表：面向数据科学和机器学习初学者的速查表

机器学习、深度学习代码速查表

Python学习速查表

机器&深度学习代码速查表.pdf

斯坦福CS229机器学习：技巧与窍门速查表

机器学习速查表打包.rar

机器学习Python速查表

算法速查表

最新资源