xvi Preface
optimization problem in such a way that a tradeoff is created between two amounts
of information, one pertaining to information contained in the bottleneck vector
about the input and the other pertaining to information contained in the bottle-
neck vector about the output.The chapter then goes on to find an optimal manifold
for data representation, using the information bottleneck method.
The final approach to unsupervised learning is described in Chapter 11, using
stochastic methods that are rooted in statistical mechanics; the study of statistical
mechanics is closely related to information theory.The chapter begins by review-
ing the fundamental concepts of Helmholtz free energy and entropy (in a statisti-
cal mechanics sense), followed by the description of Markov chains. The stage is
then set for describing the Metropolis algorithm for generating a Markov chain,
the transition probabilities of which converge to a unique and stable distribution.
The discussion of stochastic methods is completed by describing simulated an-
nealing for global optimization, followed by Gibbs sampling,which can be used as
a special form of the Metropolis algorithm.With all this background on statistical
mechanics at hand, the stage is set for describing the Boltzmann machine, which,
in a historical context, was the first multilayer learning machine discussed in the
literature. Unfortunately, the learning process in the Boltzmann machine is very
slow, particularly when the number of hidden neurons is large—hence the lack of
interest in its practical use.Various methods have been proposed in the literature
to overcome the limitations of the Boltzmann machine.The most successful inno-
vation to date is the deep belief net, which distinguishes itself in the clever way in
which the following two functions are combined into a powerful machine:
• generative modeling, resulting from bottom-up learning on a layer-by-layer
basis and without supervision;
• inference, resulting from top-down learning.
Finally, Chapter 10 describes deterministic annealing to overcome the excessive
computational requirements of simulated annealing; the only problem with
deterministic annealing is that it could get trapped in a local minimum.
5. Up to this point, the focus of attention in the book has been the formulation of al-
gorithms for supervised learning,semisupervised learning, and unsupervised learn-
ing. Chapter 12, constituting the next part of the book all by itself, addresses
reinforcement learning, in which learning takes place in an on-line manner as the
result of an agent (e.g.,robot) interacting with its surrounding environment. In re-
ality, however, dynamic programming lies at the core of reinforcement learning.
Accordingly, the early part of Chapter 15 is devoted to an introductory treatment
of Bellman’s dynamic programming, which is then followed by showing that the two
widely used methods of reinforcement learning: Temporal difference (TD) learn-
ing,and Q-learning can be derived as special cases of dynamic programming. Both
TD-learning and Q-learning are relatively simple, on-line reinforcement learning
algorithms that do not require knowledge of transition probabilities. However,
their practical applications are limited to situations in which the dimensionality
of the state space is of moderate size. In large-scale dynamic systems, the curse
of dimensionality becomes a serious issue,making not only dynamic programming,