xviii Preface to the First Edition
The overall problem of learning from interaction to achieve goals is still far from being
solved, but our understanding of it has improved significantly. We can now place compo-
nent ideas, such as temporal-di↵erence learning, dynamic programming, and fun ct i on
approximation, within a coherent perspective with respect to the overall problem.
Our goal in writing this book was to provide a clear and simple account of the key
ideas and algor i t hm s of reinforcem ent learni ng. We wanted our treatment to be accessible
to readers in all of the related discip l in es , but we c oul d not cover all of these perspectives
in detail. For the most part, our treatment takes the point of view of artificial intelligence
and engineering. Coverage of connections to other fields we leave to others or to anot h er
time. We also chose not to produce a rigorous formal treatment of reinforcement learning.
We di d not reach for the highest possible level of mathematical abstraction and did not
rely on a theorem–proof format. We tried to choose a level of mat he mat i cal detail that
points the mathematically inclined in the right directions without d i st r act i n g from the
simplicity and potential generality of the underlying ideas.
In some sense we have been working toward this book for thirty years, and we have lots
of people t o thank. First, we thank those who have personally helped us develop the overall
view presented i n this book: Harry Klopf, for helping us recognize that reinforc em ent
learning needed to be revived; Chris Watkins, Dimitri Bertsekas, John Tsitsikl i s, and
Paul Werbos, f or he l pi n g us see the value of the rel at i ons hi p s to dynami c pr ogr ammi n g;
John Moore and Jim K e hoe, for insights and insp i rat i on s from animal learning theory;
Oliver Selfridge, for emphasizin g the breadth and importance of adaptation; and, more
generally, our colleagues and students who have contributed in countless ways: Ron
Williams, Charles Anderson, Satinder Singh, Sridhar Mahadevan, St e ve Bradtke, Bob
Crites, Peter Dayan, and Leemon Baird. Our view of rei nf or ce ment learni n g has been
significantly enriched by discussions with Paul Cohen, Paul Utgo↵, Marth a Steenstrup,
Gerry Tesauro, Mike Jordan, Leslie Kaelbling, Andrew Moore, Chris Atkeson, Tom
Mitchell, Nils Nilsson, Stuart Russell, Tom Dietterich, Tom Dean, and Bob Narendra.
We thank Michael Littman, Gerry Tesauro, Bob Crites, Satinder Singh, and Wei Zhang
for providing specifics of Sections 4.7, 15.1, 15.4, 15.5, and 15.6 respective ly. We thank
the Air Force Office of Scientific Research, the National Science Foundation, and GTE
Laboratories for thei r long and farsighted support.
We also wish to thank the many people who have read drafts of this book and
provided valuable comments, including Tom Kalt, John Tsitsiklis, Pawel Cichos z , Olle
G¨allmo, Chuck Anders on, Stuart Russell, Ben Van Roy, Paul Steenstrup, Paul Cohen,
Sridhar Mahadevan, J et t e Randlov, Brian Sheppard, Thomas O’Connell, Richard Coggins,
Cristina Versino, John H. Hiet t , Andreas Badelt, Jay Ponte, Joe Beck, Justus Piater ,
Martha Steenstrup, Satinder Singh, Tommi Jaakkola, Dimitri Bertsekas, Torbj¨orn Ekman,
Christina Bj¨orkman, Jakob Carlstr¨om, and Olle Palmgren. Finally, we thank Gwyn
Mitchell for helping in many ways, and Harry Stanton and Bob P r ior for bein g our
champions at MIT Press.