深度强化学习：人工智能前沿

需积分: 10 100 浏览量更新于2024-07-17 收藏 15.91MB PDF 举报

"Deep Reinforcement Learning， Frontiers of Artificial Intelligence， 2019， Mohit Sewak， Springer Nature Singapore Pte Ltd." 深度强化学习（Deep Reinforcement Learning，DRL）是人工智能领域的一个重要分支，它结合了深度学习与强化学习的优势。在传统的强化学习中，智能体通过与环境交互，通过试错的方式学习最优策略，以最大化长期奖励。而深度学习则为强化学习提供了一种强大的特征表示和模型学习手段，使得智能体能够处理复杂、高维度的输入数据。本书《Deep Reinforcement Learning》由Mohit Sewak撰写，发表于2019年的《Frontiers of Artificial Intelligence》期刊，旨在深入探讨深度强化学习的理论与实践。作者是来自印度马哈拉施特拉邦的Pune的专家，该书的ISBN分别为978-981-13-8284-0（纸质版）和978-981-13-8285-7（电子版），并且由Springer Nature Singapore Pte Ltd.出版。书中可能涵盖了以下几个核心知识点： 1. 强化学习基础：介绍强化学习的基本概念，包括环境、状态、动作、奖励、策略和价值函数等。此外，可能会讨论Q学习、SARSA等经典算法。 2. 深度学习原理：解释神经网络的工作机制，包括卷积神经网络（CNN）、循环神经网络（RNN）以及用于强化学习的特殊结构如Actor-Critic模型。 3. DQN（Deep Q-Networks）：详细阐述DQN如何将深度学习应用于Q学习，解决了传统强化学习中的维度灾难问题，以及双线性DQN、优先经验回放缓冲区等技术。 4. 预训练和迁移学习：探讨如何利用预训练模型或迁移学习加速DRL的学习过程，如在Atari游戏上的A3C算法。 5. 连续动作空间：讲解如何处理连续动作空间的问题，如DDPG（Deep Deterministic Policy Gradient）和TD3（Twin Delayed Deep Deterministic Policy Gradient）算法。 6. 政策梯度方法：介绍如何直接优化策略的参数，如REINFORCE、TRPO（Trust Region Policy Optimization）和PPO（Proximal Policy Optimization）。 7. 模型学习与规划：讨论模型预测和模型自由的策略，比如Model-Based Reinforcement Learning（MBRL）和Model-Free RL。 8. 强化学习应用：介绍DRL在实际问题中的应用，如游戏控制、机器人控制、自动驾驶、资源调度等。 9. 实验与评估：分享如何设计实验来验证和比较不同的DRL算法，以及如何评估和调试强化学习模型。 10. 未来挑战与趋势：分析当前DRL面临的挑战，如样本效率、稳定性、泛化能力以及在真实世界中的应用，并展望未来的可能发展方向。这本书不仅对深度强化学习的理论进行了深入解析，还可能包含大量的实战案例和代码示例，帮助读者理解和掌握这一前沿技术。对于希望在强化学习领域深化研究或应用的人工智能从业者来说，是一本宝贵的参考资料。

About the Author

Mohit Sewak is a Ph.D. scholar in CS&IS (Artiﬁcial Intelligence and Cyber

Security) with BITS Pilani - Goa, India, and is also a lecturer on subjects like Artiﬁcial

Intelligence, Machine Learning, Deep Learning and NLP for the post-graduate

technical degree program. He holds several patents (USPTO & Worldwide) and

publications in the ﬁeld of Artiﬁcial Intelligence and Machine Learning.

Besides his academic linkages, Mohit is also actively engaged with the industry

and has many accomplishments while leadi ng the research and development initia-

tives of many international AI products. Mohit has been leading the Reinforcement

Learning practice at QiO Technologies, the youngest player in Gartner’s magic

quadrant for Industry 4.0.

In his previous roles, Mohit had led the IBM Watson Commerce in India’s

innovation initiative in cognitive line of feature as Sr. Cognit ive Data Scientist. Mohit

had also been the Principal Data Scientist for IBM’s global IoT products like IBM

Predictive Maintenance & Quality. He had also been the advanced analytics architect

for IBM’s SPSS suite in India.

Mohit has over 14 years of very rich experience in researching, architecting and

solutioning with technologies like TensorFlow, Torch, Caffe, Theano, Keras,

Open AI, OpenCV, SpaCy, Gensim, Spark, Kafka, ES, Kubernetes, and Tinkerpop.

xvii

Chapter 1

Introduction to Reinforcement Learning

The Intelligence Behind the AI Agent

Abstract In this chapter, we will discuss what is Reinforcement Learning and its

relationship with Arti ﬁcial Intelli gence. We would then try to go deeper to

understand the basic building blocks of Reinforcement Learning like state, actor,

environment, and the reward, and will try to understand the challenges in each of

the aspect as revealed by using multiple examples so that the intuition is well

established, and we build a solid foundation before going ahead into some

advanced topics. We would also discuss how the agent learns to take the best action

and the policy for learning the same. We will also learn the difference between the

On-Policy and the Off-Policy methods.

1.1 What Is Artiﬁcial Intelligence and How Does

Reinforcement Learning Relate to It?

Artiﬁcial Intelligence from a marketing perspective of different organizations may

mean a lot of things encompassing systems ranging from conventional analytics, to

more contemporary deep learning and chatbots. But technically the use of Artiﬁcial

Intelligence (AI) terminology is restricted to the study and design of “Rational”

agents, which could act “Humanly”. Of the many deﬁnitions given by different

researchers and authors of Artiﬁcial Intelligence, the criteria for calling an agent an

AI agent is that it should possess ability to demonstrate “thought-process and

reasoning”, “intelligent-behavior”, “success in terms of human performance”, and

“rationality”. This identiﬁcation should be our guiding factor to identify the mar-

keting jargons from real Artiﬁcial Intelligence systems and applications from the

marketing hype.

Among the different Artiﬁcial Intelligence agents, Reinforcement Learning

agents are considered to be among the most advanced and very capable of

demonstrating high level of intelligence and rational behavior. A reinforcement

learning agent interacts with its environment. The environmen t itself could

M. Sewak, Deep Reinforcement Learning,

https://doi.org/10.1007/978-981-13-8285-7_1

demonstrate multiple states. The agent acts upon the environment to change the

environment’s state, thereby also receiving a reward or penalty as determined by the

achieved state and the objective of the agent. This deﬁnition may look naïve, but the

concepts empowering it led to the development of a many advanced AI agents to

perform very complex tasks, sometimes even challenging human performance at

speciﬁc tasks.

1.2 Understanding the Basic Design of Reinforcement

Learning

The diagram as in Fig 1.1 represents a very basic design of Reinforcement Learning

system with its “learning” and “action” loops. Here an agent as described in the

above introductory deﬁnition interacts with its environment to learn to take the best

possible action (a

in the above ﬁgure) under the given state (S

) that the environ-

ment is in at step t. The action of the agent in turn changes the state of the

environment from S

to S

t+1

(as shown in the ﬁgure) and generates a reward r

for

the agent. Then the agent takes the best possible action for this new state (S

t+1

thereby invoking a reward r

t+1

and so on. Over a period of iterations (which are

referred to as experiments during the training process of the agent) the agent tries to

improve upon its decision of which is the “best action” that could be taken in a

given state of the environment using the rewards that it receives during the training

process.

The role of the environment here is thus to present the agent with different

possible/probable states that could exist in the problem that the agent may need to

react to, or a representative subset of the same. To assist the learning process of the

agent, the environment also gives the reward or penalty (a negative reward) cor-

responding to the action decisions taken by the agent in a given state. Thus, the

reward is a function of both the action and the state, and not of the action alone.

Which means that the same action could (and ideally should) receive a different

reward under different states.

Fig. 1.1 Design for a

Reinforcement Learning

system

2 1 Introduction to Reinforcement Learning

technique later in this book, but for now our objective is to highlight the challenges

pertaining to achieving the parity across rewards of varying quantum coming from

different time/s tep spans in the future which could be attributed to the action taken

in the present time/step.

1.3.2 Probabilistic/Uncertain Rewards

Another complexity in reinforcement learning is the probabilistic nature of the

rewards or uncertainty in the rewards. Let us take the same example of studying

now for a reward (good marks) later. Suppose we have 10 chapters in the course

and we know that the questions are going to come only from six of these chapters,

but we do not know from which speciﬁc six chapters the questions are going to

come, and how much weightage will each of the chosen six chapters will have in

the exams. Let us also assume that in a slot of 3 h that our protagonist could spend

in playing an outdoor game, she could study only any one of the 10 chapters.

So even if we assume that the perceived value of the future reward is worth

studying for instead of playing, we are not sure if we would be spending the present

time studying a particular chapter from which there could be no questions at all.

Even if we consider that the chosen chapter is an importan t one, we do not know the

speciﬁc weightage of the marks of the questions that would come from this chapter.

So the rewards from studying now would not only be realized in the futur e but

could also be probabilistic or uncertain.

1.3.3 Attribution of Rewards to Different Actions Taken

in the Past

Another important consideration is the attribution of the rewards to the speciﬁc

action/s taken in the past. Continuing with the above example, suppose out of the

10 chapters or corresponding 10 slots-to-play, we decide that the protagonist/agent

would randomly pick six instances where she/it would study any one of the 10

chapters (assume chosen randomly) and would play in the remaining four slots. We

make speciﬁc choices with the objective being to maximize the sum total of all the

rewards comprising of the small but immediate reward and larger but futuristic and

probabilistic rewards.

Assume that with the choices made the protagonist/agent ﬁnally end up scoring

50% in its exam. Further assume that of the six chapters we chose to study for, the

questions came from only four of them with weightages of marks as 5%, 8%, 12%,

and 15%, respectively, and the remaining 50% questions came from the four

chapters that we could not prepare as we decided to go to play instead (yes we were

sort of unlucky). Assume that of the questions that the agent answered (worth

4 1 Introduction to Reinforcement Learning

剩余214页未读，继续阅读

weixin_38290023

粉丝: 4
资源: 224

深度强化学习：人工智能前沿

Deep Reinforcement Learing

DEEP REINFORCEMENT LEARNING (Yuxi Li)

深度强化学习综述

DEEP REINFORCEMENT LEARNING

deep reinforcement learning

deep reinforcement learning for nlp

Tutorial: Deep Reinforcement Learning

Deep Reinforcement Learning using Cyclical Learning Rates.pdf

Playing Atari with Deep Reinforcement Learning

Deep Reinforcement Learning through Policy Optimization

最新资源