5
TABLE 2
Statics of typical simulators involving VLN. † means that the dataset is built in.
Simulator Dataset Observation 2D/3D
AI2-THOR† - photo-realistic 3D
VizDoom† - Virtual 2.5D
Gibson† 2D-3D-S, Matterport3D photo-realistic 3D
LANI† - Virtual 3D
House3D SUNCG photo-realistic 3D
Matterport3D Simulator Matterport3D photo-realistic 3D
Habitat Gibson, Matterport3D, Replica photo-realistic 3D
observes a first person view of the environment as an RGB
image, and receives a scalar reward. The simulator provides
a socket API to control the agent and the environment.
3.3.2 AI2-THOR
AI2-THOR [12]
5
consists of near photo-realistic 3D indoor
scenes, where AI agents can navi- gate in the scenes and
interact with objects to perform tasks. AI2-THOR enables re-
search in many different domains including but not limited
to deep reinforcement learning, imitation learning, learn-
ing by interaction, planning, visual question answering,
unsupervised representation learning, object detection and
segmentation, and learning models of cognition.
3.3.3 VizDoom
ViZDoom [16]
6
is based on the classical first-person shooter
video game, Doom. It allows developing bots that play the
game using the screen buffer. ViZDoom is lightweight, fast,
and highly customizable via a convenient mechanism of
user scenarios.
ViZDoom provides features that can be exploited in
different kinds of AI experiments. The main features include
different control modes, custom scenarios, access to the
depth buffer and off-screen rendering eliminating the need
of using a graphical interface.
3.3.4 Gibson
Gibson [68]
7
is based upon virtualizing real spaces. The
main characteristics of Gibson are: 1) being from the real-
world and reflecting its semantic complexity, 2) having
an internal synthesis mechanism “Goggles” enabling de-
ploying the trained models in real-world without needing
domain adaptation, 3) embodiment of agents and making
them subject to constraints of physics and space.
Gibson’s underlying database of spaces includes 572 full
buildings composed of 1447 floors covering a total area of
211k m2. Each space has a set of RGB panoramas with global
camera poses and reconstructed 3D meshes. The base format
of the data is similar to 2D-3D-Semantics dataset [66], but
is more diverse and includes 2 orders of magnitude more
spaces. This dataset is released as asset files within Gibson.
This simulator has also integrated 2D-3D-Semantics
dataset [66] and Matterport3D [65] for optional use.
5. http://ai2thor.allenai.org.
6. http://vizdoom.cs.put.edu.pl
7. http://gibson.vision/
3.4 Simulators
3.4.1 House3D
House3D [18]
8
is an environment which closely resem-
bles the real world and is rich in content and structure.
The 3D scenes in House3D are sourced from the SUNCG
dataset [64]. House3D offers more annotations, e.g., top-
down 2D occupancy maps, connectivity analysis and short-
est paths between two points.
To build a realistic 3D environment, House3D offers a
renderer for the SUNCG scenes. The renderer is based on
OpenGL, it can run on both Linux and MacOS, and pro-
vides RGB images, semantic segmentation masks, instance
segmentation masks and depth maps.
In House3D, an agent can live in any location within a
3D scene, as long as it does not collide with object instances
(including walls) within a small range, i.e. robot’s radius.
Doors, gates and arches are considered passage ways, mean-
ing that an agent can walk through those structures freely.
These default design choices add negligible run-time over-
head.
3.4.2 Matterport3D Simulator
Matterport3D Simulator [1]
9
is a large-scale visual reinforce-
ment learning (RL) simulation environment for the research
and development of intelligent agents based on the Matter-
port3D dataset [65]. It allow an embodied agent to virtually
‘move’ throughout a scene by adopting poses coinciding
with panoramic viewpoints. For each scene the simulator
includes a weighted, undirected graph over panoramic
viewpoints, G =< V, E >, such that the presence of an
edge signifies a robot-navigable transition between two
viewpoints. And the movement follows any edges in the
navigation graph from one viewpoint to another.
The simulator does not define or place restrictions on the
agent’s goal, reward function, or any additional context
3.4.3 Habitat
Habitat [46]
10
is a platform for research in embodied ar-
tificial intelligence (AI). Habitat enables training embodied
agents (virtual robots) in highly efficient photo-realistic 3D
simulation. Specifically, Habitat consists of:
8. http://github.com/facebookresearch/House3D
9. https://github.com/peteanderson80/Matterport3DSimulator
10. https://aihabitat.org