Published as a conference paper at ICLR 2020
Contrastive Learning Contrastive learning methods are widely used in the field of graph rep-
resentation learning (Bordes et al., 2013; Perozzi et al., 2014; Grover & Leskovec, 2016; Bordes
et al., 2013; Schlichtkrull et al., 2018; Veli
ˇ
ckovi
´
c et al., 2018), and for learning word representations
(Mnih & Teh, 2012; Mikolov et al., 2013). The main idea is to construct pairs of related data exam-
ples (positive examples, e.g., connected by an edge in a graph or co-occuring words in a sentence)
and pairs of unrelated or corrupted data examples (negative examples), and use a loss function that
scores positive and negative pairs in a different way. Most energy-based losses (LeCun et al., 2006)
are suitable for this task. Recent works (Oord et al., 2018; Hjelm et al., 2018; H
´
enaff et al., 2019;
Sun et al., 2019a; Anand et al., 2019) connect objectives of this kind to the principle of learning
representations by maximizing mutual information between data and learned representations, and
successfully apply these methods to image, speech, and video data.
State Representation Learning State representation learning in environments similar to ours is
often approached by models based on autoencoders (Corneil et al., 2018; Watter et al., 2015; Ha &
Schmidhuber, 2018; Hafner et al., 2019; Laversanne-Finot et al., 2018) or via adversarial learning
(Kurutach et al., 2018; Wang et al., 2019). Some recent methods learn state representations without
requiring a decoder back into pixel space. Examples include the selectivity objective in Thomas et al.
(2018), the contrastive objective in Franc¸ois-Lavet et al. (2018), the mutual information objective in
Anand et al. (2019), the distribution matching objective in Gelada et al. (2019) or using causality-
based losses and physical priors in latent space (Jonschkowski & Brock, 2015; Ehrhardt et al., 2018).
Most notably, Ehrhardt et al. (2018) propose a method to learn an object detection module and a
physics module jointly from raw video data without pixel-based losses. This approach, however,
can only track a single object at a time and requires careful balancing of multiple loss functions.
4 EXPERIMENTS
Our goal of this experimental section is to verify whether C-SWMs can 1) learn to discover object
representations from environment interactions without supervision, 2) learn an accurate transition
model in latent space, and 3) generalize to novel, unseen scenes. Our implementation is available
under https://github.com/tkipf/c-swm.
4.1 ENVIRONMENTS
We evaluate C-SWMs on two novel grid world environments (2D shapes and 3D blocks) involving
multiple interacting objects that can be manipulated independently by an agent, two Atari 2600
games (Atari Pong and Space Invaders), and a multi-object physics simulation (3-body physics).
See Figure 2 for example observations.
For all environments, we use a random policy to collect experience for both training and evaluation.
Observations are provided as 50 × 50 × 3 color images for the grid world environments and as 50 ×
50×6 tensors (two concatenated consecutive frames) for the Atari and 3-body physics environments.
Additional details on environments and dataset creation can be found in Appendix B.
(a) 2D Shapes
(b) 3D Blocks
(c) Atari
Pong
(d) Space
Invaders
(e) 3-Body
Physics
Figure 2: Example observations from block pushing environments (a–b), Atari 2600 games (c–
d) and a 3-body gravitational physics simulation (e). In the grid worlds (a–b), each block can
be independently moved into the four cardinal directions unless the target position is occupied by
another block or outside of the scene. Best viewed in color.
5