(a) (b) (c)
Figure 2: High-resolution screenshots of the Labyrinth environments.
(a)
Forage and Avoid showing
the apples (positive rewards) and lemons (negative rewards).
(b)
Double T-maze showing cues at
the turning points.
(c)
Top view of a Double T-maze configuration. The cues indicate the reward is
located at the top left.
state was discarded. The
k
-nearest-neighbour lookups used
k = 50
. The discount rate was set to
=0.99
. Exploration is achieved by using an
✏
-greedy policy with
✏ =0.005
. As a baseline, we
used A3C [
22
]. Labyrinth levels have deterministic transitions and rewards, but the initial location
and facing direction are randomised, and the environment is much richer, being 3-dimensional. For
this reason, unlike Atari, experiments on Labyrinth encounter very few exact matches in the buffers
of Q
EC
-values; less than 0.1% in all three levels.
Each level is progressively more difficult. The first level, called Forage, requires the agent to collect
apples as quickly as possible by walking through them. Each apple provides a reward of 1. A simple
policy of turning until an apple is seen and then moving towards it suffices here. Figure 1 shows that
the episodic controller found an apple seeking policy very quickly. Eventually A3C caught up, and
final outperforms the episodic controller with a more efficient strategy for picking up apples.
The second level, called Forage and Avoid involves collecting apples, which provide a reward of 1,
while avoiding lemons which incur a reward of
1
. The level is depicted in Figure 2(a). This level
requires only a slightly more complicated policy then Forage (same policy plus avoid lemons) yet
A3C took over 40 million steps to the same performance that episodic control attained in fewer than
3 million frames.
The third level, called Double-T-Maze, requires the agent to walk in a maze with four ends (a map
is shown in Figure 2(c)) one of the ends contains an apple, while the other three contain lemons.
At each intersection the agent is presented with a colour cue that indicates the direction in which
the apple is located (see Figure 2(b)): left, if red, or right, if green. If the agent walks through a
lemon it incurs a reward of
1
. However, if it walks through the apple, it receives a reward of
1
, is
teleported back to the starting position and the location of the apple is resampled. The duration of an
episode is limited to 1 minute in which it can reach the apple multiple times if it solves the task fast
enough. Double-T-Maze is a difficult RL problem: rewards are sparse. In fact, A3C never achieved
an expected reward above zero. Due to the sparse reward nature of the Double T-Maze level, A3C did
not update the policy strongly enough in the few instances in which a reward is encountered through
random diffusion in the state space. In contrast, the episodic controller exhibited behaviour akin to
one-shot learning on these instances, and was able to learn from the very few episodes that contain
any rewards different from zero. This allowed the episodic controller to observe between 20 and 30
million frames to learn a policy with positive expected reward, while the parametric policies never
learnt a policy with expected reward higher than zero. In this case, episodic control thrived in sparse
reward environment as it rapidly latched onto an effective strategy.
4.3 Effect of number of nearest neighbours on final score
Finally, we compared the effect of varying
k
(the number of nearest neighbours) on both Labyrinth
and Atari tasks using VAE features. In our experiments above, we noticed that on Atari re-visiting
the same state was common, and that random projections typically performed the same or better
than VAE features. One further interesting feature is that the learnt VAEs on Atari games do not
yield a higher score as the number of neighbours increases, except on one game, Q*bert, where
VAEs perform reasonably well (see Figure 3a). On Labyrinth levels, we observed that the VAEs
outperformed random projections and the agent rarely encountered the same state more than once.
Interestingly for this case, Figure 3b shows that increasing the number of nearest neighbours has a
7