1

R E C U R S I V E G E N E R A L I S E D L I N E A R M O D E L S

Deep learning and the use of deep neural networks [1] are now estab-

lished as a key tool for practical machine learning. Neural networks

have an equivalence with many existing statistical and machine learn-

ing approaches and I would like to explore one of these views in this

post. In particular, I’ll look at the view of deep neural networks as re-

cursive generalised linear models (RGLMs). Generalised linear mod-

els form one of the cornerstones of probabilistic modelling and are

used in almost every ﬁeld of experimental science, so this connection

is an extremely useful one to have in mind. I’ll focus here on what

are called feed-forward neural networks and leave a discussion of the

statistical connections to recurrent networks to another post.

1.1 generalised linear models

The basic linear regression model is a linear mapping from P-dimensional

input features (or covariates) x, to a set of targets (or responses) y, us-

ing a set of weights (or regression coefﬁcients) β and a bias (offset)

β

0

. The outputs can also by multivariate, but I’ll assume they are

scalar here. The full probabilistic model assumes that the outputs are

corrupted by Gaussian noise of unknown variance σ

2

.

η = β

>

x + β

0

y = η + ∼ N(0, σ

2

)

In this formulation, η is the systematic component of the model and

is the random component. Generalised linear models (GLMs)[2] al-

low us to extend this formulation to problems where the distribution

on the targets is not Gaussian but some other distribution (typically a

distribution in the exponential family). In this case, we can write the

generalised regression problem, combining the coefﬁcients and bias

for more compact notation, as:

η = β

>

x, β = [

ˆ

β, β

0

], x = [ˆx, 1]

E[y] = µ = g

−1

(η)

where g(·) is the link function that allows us to move from natural

parameters η to mean parameters µ. If the inverse link function used

in the deﬁnition of µ above were the logistic sigmoid, then the mean

parameters correspond to the probabilities of y being a 1 or 0 under

3

## 评论0