xviii Preface to the First Edition
The overall problem of learning from interaction to achieve goals is still far from being
solved, but our understanding of i t has impr oved significantly. We can now place compo-
nent ideas, such as temporal-di↵erence learning, dynamic programming, and function
approximation, within a coherent perspective with respect to the overall problem.
Our goal in writing this book was to provide a clear and simple account of the key
ideas and algori t h ms of reinforcement learning. We wanted our treatment to be accessible
to readers in all of the related discipli nes, but we could not cover all of these perspect i ves
in detail. For the most part, our treatment t akes the point of view of artificial intelligence
and engineering. Coverage of connections to other fields we leave to others or to another
time. We also chose not to produce a rigorous formal treatment of reinf orcement learning.
We did not reach for the highest possible level of mathematical abstraction and did not
rely on a theorem–proof format. We tried to choose a level of mathematical detail that
point s the mathematically inclined in the right directions without distracting from the
simplicity and potential gener al i ty of the underlying ideas.
...
In some sense we have been working toward this book for thirty years, and we have lots
of people to thank. First, we thank those who have personally helped us develop the overall
view presented in this book: Harry Klopf, for helping us recognize that reinforcement
learning needed to be revived; Chris Watkins, Dimitri Bertsekas, John Tsitsi k li s, and
Paul Werbos, for helping us see the value of the relationships to dynamic programming;
John Moore and Jim Kehoe, for insights and inspirations from animal learning theory;
Oliver Selfridge, for emphasizing the breadth and importance of adaptation; and, more
generally, our colleagu es and students who have contributed in countless ways: Ron
Williams, Charles Anderson, Satinder Singh, Sridhar Mahadevan, Steve Bradtke, Bob
Crites, Peter Dayan, and Leemon Baird. Our view of reinforcement learning has been
significantly enriched by discussions with Paul Cohen, Paul Utgo↵, Martha Steenstrup,
Gerry Tesauro, Mike Jordan, Leslie Kaelbling, Andrew Moore, Chris Atkeson, Tom
Mitchell, Nils Nilsson, Stuart Russell, Tom Di et t eri ch, Tom Dean, and Bob Narendra.
We thank Michael Littman, Gerry Tesauro, Bob Crites, Satinder S in gh , an d Wei Zh ang
for prov i d in g specifics of Sections 4.7, 15.1, 15.4, 15.4, and 15.6 respectively. We thank
the Air Force Office of Scientific Resear ch, the National Science Foundation, and GTE
Laboratories for their long and farsight ed sup port.
We also wish to than k the many people who have read drafts of this b ook and
provided valuable comments, including Tom Kalt, John Tsitsiklis, Pawel Cichosz, Olle
G¨allmo, Chuck Anderson, Stuart Russell , Ben Van Roy, Paul Steenstrup, Paul Cohen,
Sridhar Mahadevan, Jet te Randlov, Brian Sheppard , Thomas O’Connell, Richard Coggins,
Cristina Versino, John H. Hiett, Andreas Badelt, Jay Ponte, Joe Beck, Justus Piater,
Martha Steenstrup, Satinder Singh, Tommi Jaakkola, Dimitri Bertsekas, Torbj¨orn Ekman,
Christina Bj¨orkman, Jakob Carlstr¨om, and Olle Palmgren. Finally, we thank Gwyn
Mitchell for helping in many ways, and Harry Stanton and Bob Prior f or being our
champions at MIT Press.