3. Hanabi: A Challenge for Communication and Theory of Mind
Due to Hanabi’s limited communication, players cannot rely solely on the
grounded hints to perform well: players, including artificial agents, will need to
convey extra information through their actions. We argue this makes Hanabi a
novel challenge for artificial intelligence practitioners that requires algorithmic
advancements capable of learning to communicate and exploiting the theory
of mind reasoning commonly employed by humans. This challenge connects
research from several communities, including reinforcement learning, game the-
ory, and emergent communication. Section 6 has additional discussion of related
work from these communities.
One avenue to approach this problem is to establish conventions prior to play.
Abiding by such conventions enables agents to communicate information about
their observations (i.e., their perspective) beyond the grounded information
contained in hints. To illustrate, one convention (possibly a poor one) could
be that players never hint about a card unless they observe that it is playable.
It then follows that if an agent receives a hint about one of their cards, they
know that card is playable. Note that conventions can communicate information
either with or without a direct signal: if players expect some behaviour in certain
situations, then the absence of that behaviour indicates they are not in one of
these situations. For example, if players expect to be warned when they are at
risk of discarding a valuable card, like a 5, a lack of warning would indicate they
are not at risk. Conventions can also be used to relay more detailed information
through hints by encoding it into a player’s choice of colour or rank hint and
the target of their hint. In fact, this idea forms the basis of Cox and colleagues’
Hanabi agents [15], which we discuss in Section 5.2.
Conventions are obviously useful in the self-play setting, where all players
know others will act exactly as they would. Although experienced human players
develop and exploit conventions [14], humans are clearly able to play effectively
outside of the self-play setting: human teams consist of distinct individuals, each
with their own capabilities, and are often formed on the fly at online websites [4]
or social events. Notably, humans collaborate successfully without relying on
prior coordination or knowing the policy of their teammates, commonly referred
to as the ad-hoc team setting [16, 17]. Consequently, for artificial agents to
achieve truly superhuman performance in Hanabi’s collaborative setting, they
will need to master both settings.
3.1. A Challenge for Self-Play Learning
One way to learn conventions is to optimise a joint policy to maximize reward
in self-play. This “search over conventions” will naturally develop conventions
to overcome Hanabi’s information scarcity. However, such an optimisation is
difficult in its own right
8
, as Hanabi’s policy space is rife with locally optimal
8
Hanabi is an instance of a decentralized Markov decision process (DEC-MDP) since the
players jointly observe the full state of the game. Bernstein and colleagues [18] showed that
8