深度强化学习在通信与网络中的应用：一项综述

需积分: 50 93 浏览量更新于2024-07-15 收藏 5.17MB PDF 举报

"这篇论文是IEEE通信调查与教程在2019年第四季度第二十一卷第四期上发表的一篇综述，由Nguyen Cong Luong、Dinh Thai Hoang等多位IEEE会员和会士共同撰写。文章主要探讨了深度强化学习在通信和网络中的应用，并进行了全面的文献回顾。随着物联网（IoT）和无人驾驶飞行器（UAV）网络等现代网络变得更加去中心化和自主，网络实体需要在当地作出决策以优化网络性能，而面对网络环境的不确定性，强化学习成为一种有效的工具。然而，在复杂和大规模网络中，传统的强化学习方法可能遇到挑战。" 本文首先介绍了强化学习的基本概念，它是一种通过试错学习来优化决策策略的方法，适用于环境动态变化且结果难以预测的情境。在通信和网络领域，强化学习已被证明在资源分配、路由选择、功率控制等方面有显著效果。然而，当网络规模扩大，状态和动作空间也随之增加，传统强化学习的效率和有效性会受到影响。接着，论文深入讨论了深度强化学习（DRL）的引入，它是强化学习与深度学习的结合，能够处理高维度的状态和动作空间。DRL通过深度神经网络来近似策略函数或价值函数，允许智能体在复杂环境中学习和决策。在通信网络中，DRL可以用于动态频谱接入、自组织网络管理、无线资源管理等多个方面，表现出强大的适应性和自适应优化能力。文章进一步分析了DRL在实际通信系统中的挑战，如延迟问题、模型不确定性、以及训练数据的稀缺性。为了克服这些挑战，作者们提到了几种可能的解决方案，包括使用经验回放、近似策略迭代、以及将模型学习与策略学习相结合的混合方法。此外，论文还总结了当前研究中的热点和趋势，包括联合学习、连续动作空间的探索、以及对抗性环境下的决策制定。最后，对未来的研究方向提出了展望，包括如何进一步提高DRL的收敛速度和稳定性，以及如何更好地结合物理层特性来设计更高效的通信系统。这篇综述为读者提供了一个全面了解深度强化学习在通信和网络领域应用的窗口，对于研究人员和工程师来说，是理解这一前沿技术及其潜力的重要参考资料。

3140 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 4, FOURTH QUARTER 2019

TABLE II

OMPARISONS AMONG OPTIMIZATION TECHNIQUES

Algorithm 2 The DQL Algorithm With Experience Replay

and Fixed Target Q-Network

1: Initialize replay memory D.

2: Initialize the Q-network Q with random weights θ.

3: Initialize the target Q-network

Q with random weights θ



4: for episode=1 to T do

5: With probability  select a random action a

, otherwise

select a

= arg max Q

∗

, a

, θ).

6: Perform action a

and observe immediate reward r

and

next state s

t+1

7: Store transition (s

, a

, r

, s

t+1

) in D.

8: Select randomly samples c(s

, a

, r

, s

j +1

) from D.

9: The weights of the neural network then are optimized

by using stochastic gradient descent with respect to the

network parameter θ to minimize the loss:



+ γ max

j +1

Q(s

j +1

, a

j +1

; θ



) −Q(s

, a

; θ)



. (7)

10: Reset

Q = Q after every a ﬁxed number of steps.

11: end for

because Q-learning uses the maximum action value as an

approximation for the maximum expected action value as

shown in Eq. (4). The reason is that the same samples are used

to decide which action is the best, i.e., with highest expected

reward, and the same samples are also used to estimate that

action-value. Thus, to overcome the over-estimation problem

of the Q-learning algorithm, the authors in [25] introduce a

solution using two Q-value functions, i.e., Q

and Q

,to

simultaneously select and evaluate action values. In particular,

the selection of an action is still due to the online weights θ

This means that, as in Q-learning, we are still estimating the

value of the greedy policy according to the current values, as

deﬁned by θ

. However, the second set of weights θ

is used

to evaluate fairly the value of this policy. This second set of

weights can be updated symmetrically by switching the roles

of θ

and θ

. Inspired by this idea, the authors in [25] then

develop Double Deep Q-Learning (DDQL) model [26] using

a Double Deep Q-Network (DDQN) with the loss function

updated as follows:



+γ



j +1

, arg max

j +1



j +1

, a

j +1

; θ



; θ





−Q



, a

; θ





(8)

Unlike double Q-learning, the weights of the second

network θ

are replaced with the weights of the target

networks θ



for the evaluation of the current greedy policy

as shown in Eq. (8). The update to the target network stays

unchanged from DQN, and remains a periodic copy of the

online network. Due to the effectiveness of DDQL, there are

some applications of DDQL introduced recently to address

dynamic spectrum access problems in multichannel wire-

less networks [27] and resource allocation in heterogeneous

networks [28].

2) Deep Q-Learning With Prioritized Experience Replay:

Experience replay mechanism allows the reinforcement learn-

ing agent to remember and reuse experiences, i.e., transitions,

from the past. In particular, transitions are uniformly sampled

from the replay memory D. However, this approach simply

replays transitions at the same frequency as that the agent

was originally experienced, regardless of their signiﬁcance.

Authorized licensed use limited to: IEEE Customer. Downloaded on May 14,2020 at 02:46:23 UTC from IEEE Xplore. Restrictions apply.

LUONG et al.: APPLICATIONS OF DRL IN COMMUNICATIONS AND NETWORKING: A SURVEY 3141

Therefore, the authors in [29] develop a framework for prior-

itizing experiences, so as to replay important transitions more

frequently, and therefore learn more efﬁciently. Ideally, we

want to sample more frequently those transitions from which

there is much to learn. In general, the DQL with the Prioritized

Experience Replay (PER) samples transitions with a probabil-

ity related to the last encountered absolute error [29]. New

transitions are inserted into the replay buffer with maximum

priority, providing a bias towards recent transitions. Note that

stochastic transitions may also be favoured, even when there

is little left to learn about them. Through real experiments on

many Atari games, the authors demonstrate that DQL with

PER outperforms DQL with uniform replay on 41 out of 49

games. However, this solution is only appropriate to imple-

ment when we can ﬁnd and deﬁne the important experiences

in the replay memory D.

3) Dueling Deep Q-Learning: The Q-values, i.e., Q(s, a),

used in the Q-learning algorithm, i.e., Algorithm 1, are to

express how good it is to take a certain action at a given

state. The value of an action a at a given state s can actually

be decomposed into two fundamental values. The ﬁrst value

is the state-value function, i.e., V (s), to estimate the impor-

tance of being in a particular state s. The second value is the

action-value function, i.e., A (a), to estimate the importance of

selecting an action a compared with other actions. As a result,

the Q-value function can be expressed by two fundamental

value functions as follows: Q(s, a)=V (s)+A (a).

Stemming from the fact that in many MDPs, it is unnec-

essary to estimate both values, i.e., action and state values of

Q-function Q(s, a), at the same time. For example, in many

racing games, moving left or right matters if and only if the

agent meets the obstacles or enemies. Inspired by this idea, the

authors in [30] introduce an idea of using two streams, i.e.,

two sequences, of fully connected layers instead of using a

single sequence with fully connected layers for the DQN. The

two streams are constructed such that they are able to provide

separate estimations on the action and state value functions,

i.e., V (s) and A (a). Finally, the two streams are combined

to generate a single output Q(s

, a) as follows:

Q(s , a ; α, β)=V (s ; β)+



A (s , a ; α) −







s , a



; α



|A|



(9)

where β and α are the parameters of the two streams V (s; β)

and A (s, a



; α), respectively. Here, |A| is the total number

of actions in the action space A . Then, the loss function is

derived in the similar way to (7). Through the simulation,

the authors show that the proposed dueling DQN can out-

perform DDQN [26] in 50 out of 57 learned Atari games.

However, the proposed dueling architecture only clearly bene-

ﬁts for MDPs with large action spaces. For small state spaces,

the performance of dueling DQL is even not as good as that

of double DQL as shown in simulation results in [30].

4) Asynchronous Multi-Step Deep Q-Learning: Most of the

Q-learning methods such as DQL and dueling DQL rely on

the experience replay method. However, such kind of method

has several drawbacks. For example, it uses more memory

and computation resources per real interaction, and it requires

off-policy learning algorithms that can update from data gen-

erated by an older policy. This limits the applications of

DQL. Therefore, the authors in [31] introduce a method using

multiple agents to train the DNN in parallel. In particular,

the authors propose a training procedure which utilizes asyn-

chronous gradient decent updates from multiple agents at once.

Instead of training one single agent that interacts with its envi-

ronment, multiple agents are interacting with their own version

of the environment simultaneously. After a certain amount of

timesteps, accumulated gradient updates from an agent are

applied to a global model, i.e., the DNN. These updates are

asynchronous and lock free. In addition, to tradeoff between

bias and variance in the policy gradient, the authors adopt

n-step updates method [1] to update the reward function. In

particular, the truncated n-step reward function can be deﬁned

by r

(n)



n−1

k=0

(k)

t+k +1

. Thus, the alternative loss for

each agent will be derived by:



(n)

+ γ

(n)

max





j +n

, a



; θ





−Q



, a

; θ





. (10)

The effects of training speed and quality of the proposed

asynchronous DQL with multi-step learning are analyzed

for various reinforcement learning methods, e.g., 1-step

Q-learning, 1-step SARSA, and n-step Q-learning. They show

that asynchronous updates have a stabilizing effect on policy

and value updates. Also, the proposed method outperforms the

current state-of-the-art algorithms on the Atari games while

training for half of the time on a single multi-core CPU

instead of a GPU. As a result, some recent applications of

asynchronous DQL have been developed for handover control

problems in wireless systems [32]

5) Distributional Deep Q-Learning: All aforementioned

methods use the Bellman equation to approximate the expected

value of future rewards. However, if the environment is

stochastic in nature and the future rewards follow multimodal

distribution, choosing actions based on expected value may not

lead to the optimal outcome. For example, we know that the

expected transmission time of a packet in a wireless network

is 20 minutes. However, this information may not be so mean-

ingful because it may overestimate the transmission time most

of the time. For example, the expected transmission time is cal-

culated based on the normal transmissions (without collisions)

and the interference transmissions (with collisions). Although

the interference transmissions are very rare to happen, but it

takes a lot of time. Then, the estimation about the expected

transmission is overestimated most of the time. This makes

estimations not useful for the DQL algorithms.

Thus, the authors in [33] introduce a solution using dis-

tributional reinforcement learning to update Q-value function

based on its distribution rather than its expectation. In par-

ticular, let Z(s, a) be the return obtained by starting from

state s, executing action a, and following the current policy,

then Q(s, a)=E[Z(s, a)]. Here, Z represents the distribu-

tion of future rewards, which is no longer a scalar quantity like

Q-values. Then we obtain the distributional version of Bellman

equation as follows: Z(s, a)=r + γZ(s



, a



). Although the

proposed distributional deep Q-learning is demonstrated to

Authorized licensed use limited to: IEEE Customer. Downloaded on May 14,2020 at 02:46:23 UTC from IEEE Xplore. Restrictions apply.

剩余41页未读，继续阅读

Love_marginal

粉丝: 482
资源: 9

深度强化学习在通信与网络中的应用：一项综述

Deep Reinforcement Learning for Wireless Networks.pdf

Applications for Deep Learning.pdf

Reinforcement Learning.pdf

Deep Reinforcement Learning using Cyclical Learning Rates.pdf

[2018]Deep Reinforcement Learning for Intelligent transportation predictoin.pdf

Reinforcement Learning-Theory and Algorithm.pdf

Deep reinforcement learning from human preferences.pdf

Deep Reinforcement Learning from Human Preferences.pdf

精品--1024 + 深度强化学习（Deep Reinforcement Learning + 1024 Game).zip

A brief Survey of Deep Reinforcement Learning.pdf

最新资源