Multi-person Speech Telescience Interaction with Speech Separation
Taotao Fu
1,2
1
Key Laboratory of Space Utilization, Technology and
Engineering Center for
Space Utilization, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
Beijing, China
E-mail: 472567528@qq.com
Ge Yu
1
, Lili Guo
1
, Ji Liang
1
1
Key Laboratory of Space Utilization, Technology and
Engineering Center for
Space Utilization, Chinese Academy of Sciences
E-mail: yuge@csu.ac.cn; guolili@csu.ac.cn
liangji@csu.ac.cn
Abstract—In this paper, we proposed a method of speech
interaction that combines speech separation to solve
telescience interaction. Primarily, a speech separation
method is proposed based on Deep Clustering with local
optimization to achieve a better local separation and reduces
the distortion of speech. Then, a telescience interaction was
constructed by combining speech recognition, semantic
understanding and speech synthesis. The results show the
proposed method make it possible to accomplish a multi-
person speech telescience interaction through combining
speech separation.
Keywords-Telescience; Semantic understanding; Speech
separation; Speech recognition; Speech synthesis
I. INTRODUCTION
Tele-science implies the ability to conduct remote
operations (in space) by making rapid adjustments to
instrumental parameters and experiment procedures in
order to optimize performance and obtain the best possible
data [1]. A number of experiments in an exploratory
environment involving frequent control, such as in the
space physics and material sciences, are preferably
operated by the scientists on the ground because of the
complexity of experiments and high workload of the crew
in space. There are two essential processes for space tele-
science. One is to display the information of experiments
by receiving the telemetry data, and the other is to conduct
the facilities onboard by uploading the checked tele-
commands from the ground.
Nowadays, most scientists still need to rely on
numbers or picture on a screen to evaluate their ongoing
experiment's progress and set the target parameters to be
uploaded using a keyboard or a mouse click-by-click [2],
which seems sluggish and inefficient.
Therefore, it is necessary to explore an intuitive
human-computer interface consisted of speech interaction
that allows the experiment to become more interesting and
efficient through the direct involvement of scientists in the
experiment. Also a more nature free-hand interaction with
the device in the virtual experiment environment can
provide users a home institute-like space for planning,
scheduling, operations and correlative analysis.
Language is the most natural and convenient way of
communication for mankind since ancient times.
Correspondingly, speech interaction is one of the most
direct and natural interaction modes in human-computer
interaction. In the era of artificial intelligence of future, we
will certainly liberate our hands thoroughly through the
interaction of speeches. Speech interaction mainly includes
speech recognition, natural language processing and
speech synthesis and other fields that tend to mature with
the rapid development of artificial intelligent(AI). There
are many advantages of speech interaction, such as react
immediately when heard the command, simplicity of
operator, extensive usage scenarios and Conjectural
discourse meaning. Bruno [3] adopted speech interaction
for training navy officers’ adaptive training simulation.
Ishihara [4] proposes a method for manufactured objects
such as anime figures to exhibit highly realistic behavioral
expressions to improve speech interaction between a user
and an object. Robert [5] combined speech and deictic
gestures to instruct the car about desired interventions
which include spatial references to the current
environment.
There could be speech interaction integration into
remote scientific experiments, so as to improve the
interaction naturally and flexibility, and ameliorate the
quality and efficiency of the experiment. However, in the
existing speech interaction system, there are noise or
“cocktail-party [6]” problem, which will cause low
recognition rate and low accuracy. Therefore, it is
necessary to consider how to separate each speech from
mixed speech, thereby enhancing the efficiency of remote
scientific interaction.
In this paper, a method of speech interaction that
combines speech separation is proposed to address
telescience interaction. The following sections of this
article are organized as follows. Methods based on deep
clustering (DC) with local optimization been proposed to
solve the problem of obtaining each voice from a mixed
voice firstly. Secondly, we construct a speech telescience
through combining speech recognition, semantic
understanding and speech synthesis. At last, several
experiments are performed to validate our proposed
method.
II. M
ETHODS
In this paper, the speech recognition module based on
IFLYTEK is used for collecting sound continuously and
obtaining corresponding text information. Thirdly, the
results obtained are processed by rule matching, semantic