output
i
Linear(concat(h
in
i
, result
i
, (x
i
,y
i
,z
i
),s
robot
))
. In practice, we use multiple query
heads per block, so that the size of each result
i
will be proportional to the number of query heads.
4.2 Context network
The context network is the crux of our model. It processes both the current state and the embedding
produced by the demonstration network, and outputs a context embedding, whose dimension does
not depend on the length of the demonstration, or the number of blocks in the environment. Hence, it
is forced to capture only the relevant information, which will be used by the manipulation network.
Attention over demonstration
: The context network starts by computing a query vector as a function
of the current state, which is then used to attend over the different time steps in the demonstration
embedding. The attention weights over different blocks within the same time step are summed
together, to produce a single weight per time step. The result of this temporal attention is a vector
whose size is proportional to the number of blocks in the environment. We then apply neighborhood
attention to propagate the information across the embeddings of each block. This process is repeated
multiple times, where the state is advanced using an LSTM cell with untied weights.
Attention over current state
: The previous operations produce an embedding whose size is inde-
pendent of the length of the demonstration, but still dependent on the number of blocks. We then
apply standard soft attention over the current state to produce fixed-dimensional vectors, where the
memory content only consists of positions of each block, which, together with the robot’s state, forms
the context embedding, which is then passed to the manipulation network.
Intuitively, although the number of objects in the environment may vary, at each stage of the
manipulation operation, the number of relevant objects is small and usually fixed. For the block
stacking environment specifically, the robot should only need to pay attention to the position of the
block it is trying to pick up (the source block), as well as the position of the block it is trying to place
on top of (the target block). Therefore, a properly trained network can learn to match the current
state with the corresponding stage in the demonstration, and infer the identities of the source and
target blocks expressed as soft attention weights over different blocks, which are then used to extract
the corresponding positions to be passed to the manipulation network. Although we do not enforce
this interpretation in training, our experiment analysis supports this interpretation of how the learned
policy works internally.
4.3 Manipulation network
The manipulation network is the simplest component. After extracting the information of the source
and target blocks, it computes the action needed to complete the current stage of stacking one block
on top of another one, using a simple MLP network.
1
This division of labor opens up the possibility
of modular training: the manipulation network may be trained to complete this simple procedure,
without knowing about demonstrations or more than two blocks present in the environment. We leave
this possibility for future work.
5 Experiments
We conduct experiments with the block stacking tasks described in Section 3.2.
2
These experiments
are designed to answer the following questions:
• How does training with behavioral cloning compare with DAGGER?
•
How does conditioning on the entire demonstration compare to conditioning o n the final
state, even when it already has enough information to fully specify the task?
•
How does conditioning on the entire demonstration compare to conditioning on a “snapshot”
of the trajectory, which is a small subset of frames that are most informative?
1
In principle, one can replace this module with an RNN module. But we did not find this necessary for the
tasks we consider.
2
Additional experiment results are available in the Appendix, including a simple illustrative example of
particle reaching tasks and further analysis of block stacking
6