SCHWARTZ-ZIV AND TISHBY
2. Information Theory of Deep Learning
In supervised learning we are interested in good representations, T(X), of the input patterns x ∈ X,
that enable good predictions of the label y ∈ Y . Moreover, we want to efficiently learn such
representations from an empirical sample of the (unknown) joint distribution P (X, Y ), in a way
that provides good generalization.
DNNs and Deep Learning generate a Markov chain of such representations, the hidden layers,
by minimization of the empirical error over the weights of the network, layer by layer. This opti-
mization takes place via stochastic gradient descent (SGD), using a noisy estimate of the gradient
of the empirical error at each weight, through back-propagation.
Our first important insight is to treat the whole layer, T , as a single random variable, charac-
terized by its encoder, P (T |X), and decoder, P (Y |T ) distributions. As we are only interested in
the information that flows through the network, invertible transformations of the representations,
that preserve information, generate equivalent representations even if the individual neurons encode
entirely different features of the input. For this reason we quantify the representations by two num-
bers, or order parameters, that are invariant to any invertible re-parameterization of T , the mutual
information of T with the input X and the desired output Y .
Next, we quantify the quality of the layers by comparing them to the information theoretic
optimal representations, the Information Bottleneck representations, and then describe how Deep
Learning SGD can achieve these optimal representations.
2.1 Mutual Information
Given any two random variables, X and Y , with a joint distribution p(x, y), their Mutual Informa-
tion is defined as:
I(X; Y ) = D
KL
[p(x, y)||p(x)p(y)] =
X
x∈X,y∈Y
p(x, y) log
p (x, y)
p (x) p (y)
(1)
=
X
x∈X,y∈Y
p (x, y) log
p (x|y)
p (x)
= H(X) − H(X|Y ) , (2)
where D
KL
[p||q] is the Kullback-Liebler divergence of the distributions p and q, and H(X) and
H(X|Y ) are the entropy and conditional entropy of X and Y , respectively.
The mutual information quantifies the number of relevant bits that the input variable X contains
about the label Y , on average. The optimal learning problem can be cast as the construction of an
optimal encoder of that relevant information via an efficient representation - a minimal sufficient
statistic of X with respect to Y - if such can be found. A minimal sufficient statistic can enable
the decoding of the relevant information with the smallest number of binary questions (on average);
i.e., an optimal code. The connection between mutual information and minimal sufficient statistics
is discussed in 2.3.
Two properties of the mutual information are very important in the context of DNNs. The first
is its invariance to invertible transformations:
I (X; Y ) = I (ψ(X); φ(Y ))) (3)
for any invertible functions φ and ψ.
4