xviii Preface to the First Edition
The overall problem of learning from interaction to achieve goals is still far from being
solved, but our understanding of it has improved significantly. We can now place compo-
nent ideas, such as temporal-difference learning, dynamic programming, and function
approximation, within a coherent perspective with respect to the overall problem.
Our goal in writing this book was to provide a clear and simple account of the key
ideas and algorithms of reinforcement learning. We wanted our treatment to be accessible
to readers in all of the related disciplines, but we could not cover all of these perspectives
in detail. For the most part, our treatment takes the point of view of artificial intelligence
and engineering. Coverage of connections to other fields we leave to others or to another
time. We also chose not to produce a rigorous formal treatment of reinforcement learning.
We did not reach for the highest possible level of mathematical abstraction and did not
rely on a theorem–proof format. We tried to choose a level of mathematical detail that
points the mathematically inclined in the right directions without distracting from the
simplicity and potential generality of the underlying ideas.
[Three paragraphs elided in favor of updated content in the second edition.]
In some sense we have been working toward this book for thirty years, and we have lots
of people to thank. First, we thank those who have personally helped us develop the overall
view presented in this book: Harry Klopf, for helping us recognize that reinforcement
learning needed to be revived; Chris Watkins, Dimitri Bertsekas, John Tsitsiklis, and
Paul Werbos, for helping us see the value of the relationships to dynamic programming;
John Moore and Jim Kehoe, for insights and inspirations from animal learning theory;
Oliver Selfridge, for emphasizing the breadth and importance of adaptation; and, more
generally, our colleagues and students who have contributed in countless ways: Ron
Williams, Charles Anderson, Satinder Singh, Sridhar Mahadevan, Steve Bradtke, Bob
Crites, Peter Dayan, and Leemon Baird. Our view of reinforcement learning has been
significantly enriched by discussions with Paul Cohen, Paul Utgoff, Martha Steenstrup,
Gerry Tesauro, Mike Jordan, Leslie Kaelbling, Andrew Moore, Chris Atkeson, Tom
Mitchell, Nils Nilsson, Stuart Russell, Tom Dietterich, Tom Dean, and Bob Narendra.
We thank Michael Littman, Gerry Tesauro, Bob Crites, Satinder Singh, and Wei Zhang
for providing specifics of Sections 4.7, 15.1, 15.4, 15.5, and 15.6 respectively. We thank
the Air Force Office of Scientific Research, the National Science Foundation, and GTE
Laboratories for their long and farsighted support.
We also wish to thank the many people who have read drafts of this book and
provided valuable comments, including Tom Kalt, John Tsitsiklis, Pawel Cichosz, Olle
G¨allmo, Chuck Anderson, Stuart Russell, Ben Van Roy, Paul Steenstrup, Paul Cohen,
Sridhar Mahadevan, Jette Randlov, Brian Sheppard, Thomas O’Connell, Richard Coggins,
Cristina Versino, John H. Hiett, Andreas Badelt, Jay Ponte, Joe Beck, Justus Piater,
Martha Steenstrup, Satinder Singh, Tommi Jaakkola, Dimitri Bertsekas, Torbj¨orn Ekman,
Christina Bj¨orkman, Jakob Carlstr¨om, and Olle Palmgren. Finally, we thank Gwyn
Mitchell for helping in many ways, and Harry Stanton and Bob Prior for being our
champions at MIT Press.