Figure 4: XiaoIce system architecture.
• User experience layer:
This layer connects XiaoIce to popular chat platforms (e.g., WeChat,
QQ), and communicates with users in two modes: full-duplex and taking turns. The full-
duplex mode handles voice-stream-based conversations where a user and XiaoIce can talk to
each other simultaneously. The other mode deals with message-based conversations where a
user and XiaoIce take turns to talk. This layer also includes a set of components used to
process user inputs and XiaoIce responses e.g., speech recognition and synthesis, image
understanding and text normalization.
• Conversation engine layer:
This is composed of a dialogue manager, an empathetic
computing module, Core Chat and dialogue skills. The dialogue manager keeps track of
the dialogue state, selects either a dialogue skill or Core Chat
1
using the dialogue policy to
generate responses. The empathetic computing module is designed to understand not only
the content of the user input (e.g., topic) but also the empathetic aspects of the dialogue
and the user (e.g., emotion, intent, opinion on topic, and the user’s background and general
interests). It reflects XiaoIce’s EQ and demonstrates XiaoIce’s social skills to ensure the
generation of interpersonal responses that fit XiaoIce’s personality. XiaoIce’s IQ is shown
by a collection of specific skills and Core Chat.
• Data layer:
This consists of a set of databases that store collected human conversational
data (in text pairs or text-image pairs), non-conversational data and knowledge graphs used
for Core Chat and skills, and the profiles of XiaoIce and all the registered users.
4 Implementation of Conversation Engine
This section describes four major components in the conversation engine layer: dialogue manager,
empathetic computing, Core Chat, and skills.
4.1 Dialogue Manager
Dialogue Manager is the central controller of the dialogue system. It consists of Global State Tracker
that is responsible for keeping track of the current dialogue state
s
, and Dialogue Policy
π
that selects
an action based on the dialogue state as
a = π(s)
. The action can be either a skill or Core Chat
activated by the top-level policy to respond to the user’s specific request, or a response suggested by
a skill-specific low-level policy.
1
Although Core Chat is by definition a dialogue skill, we single it out by referring it as Core Chat directly
due to its importance and sophisticated design, and refer to other dialogue skills as skills.
6