Interactively Picking Real-World Objects with
Unconstrained Spoken Language Instructions
Jun Hatori
∗
, Yuta Kikuchi
∗
, Sosuke Kobayashi
∗
, Kuniyuki Takahashi
∗
,
Yuta Tsuboi
∗
, Yuya Unno
∗
, Wilson Ko, Jethro Tan
†
Abstract— Comprehension of spoken natural language is an
essential skill for robots to communicate with humans effec-
tively. However, handling unconstrained spoken instructions
is challenging due to (1) complex structures and the wide
variety of expressions used in spoken language, and (2) inherent
ambiguity of human instructions. In this paper, we propose the
first comprehensive system for controlling robots with uncon-
strained spoken language, which is able to effectively resolve
ambiguity in spoken instructions. Specifically, we integrate deep
learning-based object detection together with natural language
processing technologies to handle unconstrained spoken instruc-
tions, and propose a method for robots to resolve instruction
ambiguity through dialogue. Through our experiments on both
a simulated environment as well as a physical industrial robot
arm, we demonstrate the ability of our system to understand
natural instructions from human operators effectively, and
show how higher success rates of the object picking task can
be achieved through an interactive clarification process.
1
I. INTRODUCTION
As robots become more omnipresent, there is also an
increasing need for humans to interact with robots in a handy
and intuitive way. For many real-world tasks, use of spoken
language instructions is more intuitive than programming,
and is more versatile than alternative communication meth-
ods such as touch panel user interfaces [1] or gestures [2] due
to the possibility of referring to abstract concepts or the use
of high-level instructions. Hence, using natural language as
a means to interact between humans and robots is desirable.
However, there are two major challenges to realize the
concept of robots that interpret language and act accord-
ingly. First, spoken language instructions as used in our daily
lives have neither predefined structures nor a limited vocabu-
lary, and often include uncommon and informal expressions,
e.g., “Hey man, grab that brown fluffy thing”, see Figure 1.
Second, there is inherent ambiguity in interpreting spoken
languages, since humans do not always put effort in making
their instructions clear. For example, there might be multiple
“fluffy” objects present in the environment like in Figure 1,
in which case the robot would need to ask back: e.g.
“Which one?”. Although proper handling of such diverse and
ambiguous expressions is a critical factor towards building
domestic or service robots, little effort has been made to
∗
The starred authors are contributed equally and ordered alphabetically.
†
All authors are associated with Preferred Networks, Inc. {hatori,
kikuchi, sosk, takahashi, tsuboi, unno, wko, jettan}@preferred.jp
1
Accompanying videos are available at the following links:
https://youtu.be/_Uyv1XIUqhk (the system submitted to ICRA-
2018) and http://youtu.be/DGJazkyw0Ws (with improvements af-
ter ICRA-2018 submission)
Fig. 1: An illustration of object picking via human–robot interac-
tion. Our robot asks for clarification if the given instruction
has interpretation ambiguity.
date to address these challenges, especially in the context of
human–robot interaction.
In this paper, we tackle these two challenges in spoken
human–robot communication, and develop a robotic system
which a human operator can communicate with using uncon-
strained spoken language instructions. To handle complex
structures and cope with the diversity of unconstrained
language, we combine and modify existing state-of-the-art
models for object detection [3], [4] and object-referring
expressions [5], [6] into an integrated system that can handle
a wide variety of spoken expressions and their mapping
to miscellaneous objects in a real-world environment. This
modification makes it possible to train the network without
explicit object class information, and to realize zero-shot
recognition of unseen objects. To handle inherent ambiguity
in spoken instructions, our system also focuses on the pro-
cess of interactive clarification, where ambiguity in a given
instruction can be resolved through dialogue. Moreover, our
system agent combines verbal and visual feedback as shown
in Figure 1 in such a way that the human operator can provide
additional explanations to narrow down the object of interest
similar to how humans communicate. We show that spoken
language instructions are indeed effective in improving the
end-to-end accuracy of real-world object picking.
Although the use of natural language instructions has
received attention in the field of robotics [7]–[10], our work
is the first to propose a comprehensive system integrating the
process of interactive clarification while supporting uncon-
strained spoken instructions through human–robot dialogue.
To evaluate our system in a complex, realistic environment,
arXiv:1710.06280v2 [cs.RO] 28 Mar 2018