Published as a conference paper at ICLR 2020
progress in deep RL through surfacing dozens of Atari 2600 games as learning environments
(Bellemare et al., 2013). Similar projects have been crucial to progress in continuous control
(Duan et al., 2016; Tassa et al., 2018), model-based RL (Wang et al., 2019) and even rich
3D games (Beattie et al., 2016). Performing well in these complex environments requires
the integration of many core agent capabilities. We might think of these benchmarks as
natural successors to ‘CartPole’ or ‘MountainCar’.
The Behaviour Suite for Reinforcement Learning offers a complementary approach to exist-
ing benchmarks in RL, with several novel components:
1. bsuite experiments enforce a specific methodology for agent evaluation beyond just the
environment definition. This is crucial for scientific comparisons and something that has
become a major problem for many benchmark suites (Machado et al., 2017) (Section 2).
2. bsuite aims to isolate core capabilities with targeted ‘unit tests’, rather than integrate
general learning ability. Other benchmarks evolve by increasing complexity, bsuite aims
to remove all confounds from the core agent capabilities of interest (Section 3).
3. bsuite experiments are designed with an emphasis on scalability rather than final per-
formance. Previous ‘unit tests’ (such as ‘Taxi’ or ‘RiverSwim’) are of fixed size, bsuite
experiments are specifically designed to vary the complexity smoothly (Section 2).
4. github.com/deepmind/bsuite has an extraordinary emphasis on the ease of use, and
compatibility with RL agents not specifically designed for bsuite. Evaluating an agent
on bsuite is practical even for agents designed for a different benchmark (Section 4).
2 Experiments
This section outlines the experiments included in the Behaviour Suite for Reinforcement
Learning 2019 release. In the context of bsuite, an experiment consists of three parts:
1. Environments: a fixed set of environments determined by some parameters.
2. Interaction: a fixed regime of agent/environment interaction (e.g. 100 episodes).
3. Analysis: a fixed procedure that maps agent behaviour to results and plots.
One crucial part of each bsuite analysis defines a ‘score’ that maps agent performance on
the task to [0, 1]. This score allows for agent comparison ‘at a glance’, the Jupyter notebook
includes further detailed analysis for each experiment. All experiments in bsuite only
measure behavioural aspects of RL agents. This means that they only measure properties
that can be observed in the environment, and are not internal to the agent. It is this choice
that allows bsuite to easily generate and compare results across different algorithms and
codebases. Researchers may still find it useful to investigate internal aspects of their agents
on bsuite environments, but it is not part of the standard analysis.
Every current and future bsuite experiment should target some key issue in RL. We aim
for simple behavioural experiments, where agents that implement some concept well score
better than those that don’t. For an experiment to be included in bsuite it should embody
five key qualities:
• Targeted: performance in this task corresponds to a key issue in RL.
• Simple: strips away confounding/confusing factors in research.
• Challenging: pushes agents beyond the normal range.
• Scalable: provides insight on scalability, not performance on one environment.
• Fast: iteration from launch to results in under 30min on standard CPU.
Where our current experiments fall short, we see this as an opportunity to improve the
Behaviour Suite for Reinforcement Learning in future iterations. We can do this both
through replacing experiments with improved variants, and through broadening the scope
of issues that we consider.
We maintain the full description of each of our experiments through the code and accom-
panying documentation at github.com/deepmind/bsuite. In the following subsections, we
pick two bsuite experiments to review in detail: ‘memory length’ and ‘deep sea’, and review
these examples in detail. By presenting these experiments as examples, we can emphasize
what we think makes bsuite a valuable tool for investigating core RL issues. We do provide
a high level summary of all other current experiments in Appendix A.
4