LUONG et al.: APPLICATIONS OF DRL IN COMMUNICATIONS AND NETWORKING: A SURVEY 3141
Therefore, the authors in [29] develop a framework for prior-
itizing experiences, so as to replay important transitions more
frequently, and therefore learn more efficiently. Ideally, we
want to sample more frequently those transitions from which
there is much to learn. In general, the DQL with the Prioritized
Experience Replay (PER) samples transitions with a probabil-
ity related to the last encountered absolute error [29]. New
transitions are inserted into the replay buffer with maximum
priority, providing a bias towards recent transitions. Note that
stochastic transitions may also be favoured, even when there
is little left to learn about them. Through real experiments on
many Atari games, the authors demonstrate that DQL with
PER outperforms DQL with uniform replay on 41 out of 49
games. However, this solution is only appropriate to imple-
ment when we can find and define the important experiences
in the replay memory D.
3) Dueling Deep Q-Learning: The Q-values, i.e., Q(s, a),
used in the Q-learning algorithm, i.e., Algorithm 1, are to
express how good it is to take a certain action at a given
state. The value of an action a at a given state s can actually
be decomposed into two fundamental values. The first value
is the state-value function, i.e., V (s), to estimate the impor-
tance of being in a particular state s. The second value is the
action-value function, i.e., A (a), to estimate the importance of
selecting an action a compared with other actions. As a result,
the Q-value function can be expressed by two fundamental
value functions as follows: Q(s, a)=V (s)+A (a).
Stemming from the fact that in many MDPs, it is unnec-
essary to estimate both values, i.e., action and state values of
Q-function Q(s, a), at the same time. For example, in many
racing games, moving left or right matters if and only if the
agent meets the obstacles or enemies. Inspired by this idea, the
authors in [30] introduce an idea of using two streams, i.e.,
two sequences, of fully connected layers instead of using a
single sequence with fully connected layers for the DQN. The
two streams are constructed such that they are able to provide
separate estimations on the action and state value functions,
i.e., V (s) and A (a). Finally, the two streams are combined
to generate a single output Q(s
, a) as follows:
Q(s , a ; α, β)=V (s ; β)+
A (s , a ; α) −
a
A
s , a
; α
|A|
,
(9)
where β and α are the parameters of the two streams V (s; β)
and A (s, a
; α), respectively. Here, |A| is the total number
of actions in the action space A . Then, the loss function is
derived in the similar way to (7). Through the simulation,
the authors show that the proposed dueling DQN can out-
perform DDQN [26] in 50 out of 57 learned Atari games.
However, the proposed dueling architecture only clearly bene-
fits for MDPs with large action spaces. For small state spaces,
the performance of dueling DQL is even not as good as that
of double DQL as shown in simulation results in [30].
4) Asynchronous Multi-Step Deep Q-Learning: Most of the
Q-learning methods such as DQL and dueling DQL rely on
the experience replay method. However, such kind of method
has several drawbacks. For example, it uses more memory
and computation resources per real interaction, and it requires
off-policy learning algorithms that can update from data gen-
erated by an older policy. This limits the applications of
DQL. Therefore, the authors in [31] introduce a method using
multiple agents to train the DNN in parallel. In particular,
the authors propose a training procedure which utilizes asyn-
chronous gradient decent updates from multiple agents at once.
Instead of training one single agent that interacts with its envi-
ronment, multiple agents are interacting with their own version
of the environment simultaneously. After a certain amount of
timesteps, accumulated gradient updates from an agent are
applied to a global model, i.e., the DNN. These updates are
asynchronous and lock free. In addition, to tradeoff between
bias and variance in the policy gradient, the authors adopt
n-step updates method [1] to update the reward function. In
particular, the truncated n-step reward function can be defined
by r
(n)
t
=
n−1
k=0
γ
(k)
r
t+k +1
. Thus, the alternative loss for
each agent will be derived by:
r
(n)
j
+ γ
(n)
j
max
a
ˆ
Q
s
j +n
, a
; θ
−Q
s
j
, a
j
; θ
2
. (10)
The effects of training speed and quality of the proposed
asynchronous DQL with multi-step learning are analyzed
for various reinforcement learning methods, e.g., 1-step
Q-learning, 1-step SARSA, and n-step Q-learning. They show
that asynchronous updates have a stabilizing effect on policy
and value updates. Also, the proposed method outperforms the
current state-of-the-art algorithms on the Atari games while
training for half of the time on a single multi-core CPU
instead of a GPU. As a result, some recent applications of
asynchronous DQL have been developed for handover control
problems in wireless systems [32]
5) Distributional Deep Q-Learning: All aforementioned
methods use the Bellman equation to approximate the expected
value of future rewards. However, if the environment is
stochastic in nature and the future rewards follow multimodal
distribution, choosing actions based on expected value may not
lead to the optimal outcome. For example, we know that the
expected transmission time of a packet in a wireless network
is 20 minutes. However, this information may not be so mean-
ingful because it may overestimate the transmission time most
of the time. For example, the expected transmission time is cal-
culated based on the normal transmissions (without collisions)
and the interference transmissions (with collisions). Although
the interference transmissions are very rare to happen, but it
takes a lot of time. Then, the estimation about the expected
transmission is overestimated most of the time. This makes
estimations not useful for the DQL algorithms.
Thus, the authors in [33] introduce a solution using dis-
tributional reinforcement learning to update Q-value function
based on its distribution rather than its expectation. In par-
ticular, let Z(s, a) be the return obtained by starting from
state s, executing action a, and following the current policy,
then Q(s, a)=E[Z(s, a)]. Here, Z represents the distribu-
tion of future rewards, which is no longer a scalar quantity like
Q-values. Then we obtain the distributional version of Bellman
equation as follows: Z(s, a)=r + γZ(s
, a
). Although the
proposed distributional deep Q-learning is demonstrated to
Authorized licensed use limited to: IEEE Customer. Downloaded on May 14,2020 at 02:46:23 UTC from IEEE Xplore. Restrictions apply.