合作与信息不完美：《Hanabi挑战：人工智能新边界》

需积分: 9 13 浏览量更新于2024-07-17 收藏 1.01MB PDF 举报

《人工智能的新边界：汉纳比挑战》(A New Frontier for AI Research: The Hanabi Challenge)论文揭示了游戏在人工智能研究中的核心地位。自计算机诞生以来，游戏一直是衡量机器复杂决策能力的重要实验室。近年来，随着机器学习的进步，人工智能代理在围棋、雅达利游戏和某些扑克变种等挑战性领域中实现了超越人类的表现。这些游戏，如象棋、国际跳棋和西洋双陆棋，曾推动了AI技术的发展，通过提供既有深度又明确的挑战。本文着重介绍了一个新的挑战领域——汉纳比游戏（Hanabi），它将纯粹的合作游戏玩法与不完全信息环境结合，设定在2至5人玩家的框架内。这种独特的组合带来了前所未有的问题，特别是对其他玩家信念和意图理解的深度需求。作者认为，在汉纳比游戏中，理论思维（Theory of Mind）的能力至关重要，因为成功不仅取决于个体的策略，还依赖于对伙伴行为的理解和沟通协调。理论思维是人工智能迈向更高层次的关键，不仅对于在汉纳比游戏中取得胜利至关重要，而且对于更广泛的协作环境中，尤其是与人类合作伙伴的互动，同样具有重要意义。为了促进未来的研究，作者团队发布了开源的汉纳比学习环境（Hanabi Learning Environment），该环境为研究社区提供了一个实验框架，用于评估算法改进，并测试现有最先进的技术性能。在这个挑战中，多Agent学习（Multi-Agent Learning）、强化学习（Reinforcement Learning）、沟通（Communication）、不完全信息（Imperfect Information）以及合作性（Cooperation）等因素交织在一起，构成了一个既富有挑战又富有洞察力的研究课题。研究人员可以在此基础上探索新的算法和技术，以解决复杂的协同决策问题，从而推动人工智能的整体发展。

For example, in Figure 1 both green 2s have been discarded, an eﬀective loss

of four points as no higher rank green cards will ever be playable. As a result,

hinting to players that are at risk of discarding the only remaining card of a

given rank and colour is often prioritized. This is particularly common for rank-

5 cards since there is only one of each colour and they often need to be held for

a long time before the card can successfully be played.

2.2. Implicit Communication

While explicit communication in Hanabi is limited to the hint actions, every

action taken in Hanabi is observed by all players and can also implicitly com-

municate information. This implicit information is not conveyed through the

impact that an action has on the environment (i.e., what happens) but through

the very fact that another player decided to take this action (i.e., why it hap-

pened). This requires that players can reason over the actions that another

player would have taken in a number of diﬀerent situations, essentially reason-

ing over the intent of the agent. Human players often exploit such reasoning

to convey more information through their actions. Consider the situation in

Figure 1 and assume the active player (Player 0) knows nothing about their

own cards, and so they choose to hint to another player. One option would

be tell Player 1 about the 1s in their hand. However, that information is not

particularly actionable, as the yellow 1 is not currently playable. Instead, they

could tell Player 1 about the red card, which is a 1. Although Player 1 would

not explicitly know the card is a 1, and therefore playable, they could infer that

it is playable as there would be little reason to tell them about it otherwise,

especially when Player 2 has a blue 1 that would be beneﬁcial to hint. They

may also infer that because Player 0 chose to hint with the colour rather than

the rank, that one of their other cards is a non-playable 1. This is an example

of the type of pragmatic reasoning that humans commonly use.

An even more eﬀective, though also more sophisticated, tactic commonly

employed by humans is the so-called “ﬁnesse” move. To perform the ﬁnesse in

this situation, Player 0 would tell Player 2 that they have a 2. By the same

pragmatic reasoning as above, Player 2 could falsely infer that their red 2 is the

playable white 2 (since both green 2s were already discarded). Player 1 can

see Player 2’s red 2 and realize that Player 2 will make this incorrect inference

and mistakenly play the card, leading Player 1 to question why Player 0 would

have chosen this seemingly irrational hint. The only rational explanation for

the choice is that Player 1 themselves must hold the red 1 (in a predictable

position, such as the most recently drawn card) and is expected to rescue the

play. Using this tactic, Player 0 can reveal enough information to get two cards

played using only a single information token. There are many other moves that

rely on this kind of reasoning about intent to convey information (e.g., bluﬀ,

reverse ﬁnesse) [14]. We will use ﬁnesse to broadly refer to this style of move.

3. Hanabi: A Challenge for Communication and Theory of Mind

Due to Hanabi’s limited communication, players cannot rely solely on the

grounded hints to perform well: players, including artiﬁcial agents, will need to

convey extra information through their actions. We argue this makes Hanabi a

novel challenge for artiﬁcial intelligence practitioners that requires algorithmic

advancements capable of learning to communicate and exploiting the theory

of mind reasoning commonly employed by humans. This challenge connects

research from several communities, including reinforcement learning, game the-

ory, and emergent communication. Section 6 has additional discussion of related

work from these communities.

One avenue to approach this problem is to establish conventions prior to play.

Abiding by such conventions enables agents to communicate information about

their observations (i.e., their perspective) beyond the grounded information

contained in hints. To illustrate, one convention (possibly a poor one) could

be that players never hint about a card unless they observe that it is playable.

It then follows that if an agent receives a hint about one of their cards, they

know that card is playable. Note that conventions can communicate information

either with or without a direct signal: if players expect some behaviour in certain

situations, then the absence of that behaviour indicates they are not in one of

these situations. For example, if players expect to be warned when they are at

risk of discarding a valuable card, like a 5, a lack of warning would indicate they

are not at risk. Conventions can also be used to relay more detailed information

through hints by encoding it into a player’s choice of colour or rank hint and

the target of their hint. In fact, this idea forms the basis of Cox and colleagues’

Hanabi agents [15], which we discuss in Section 5.2.

Conventions are obviously useful in the self-play setting, where all players

know others will act exactly as they would. Although experienced human players

develop and exploit conventions [14], humans are clearly able to play eﬀectively

outside of the self-play setting: human teams consist of distinct individuals, each

with their own capabilities, and are often formed on the ﬂy at online websites [4]

or social events. Notably, humans collaborate successfully without relying on

prior coordination or knowing the policy of their teammates, commonly referred

to as the ad-hoc team setting [16, 17]. Consequently, for artiﬁcial agents to

achieve truly superhuman performance in Hanabi’s collaborative setting, they

will need to master both settings.

3.1. A Challenge for Self-Play Learning

One way to learn conventions is to optimise a joint policy to maximize reward

in self-play. This “search over conventions” will naturally develop conventions

to overcome Hanabi’s information scarcity. However, such an optimisation is

diﬃcult in its own right

, as Hanabi’s policy space is rife with locally optimal

Hanabi is an instance of a decentralized Markov decision process (DEC-MDP) since the

players jointly observe the full state of the game. Bernstein and colleagues [18] showed that

剩余36页未读，继续阅读

tower429

粉丝: 0
资源: 16

合作与信息不完美：《Hanabi挑战：人工智能新边界》

frontier与stata操作案例比较.pdf

Genomics_A_New_Frontier_for_Privacy_and_Security.pdf

the sharpe ratio efficient frontier.pdf

用python程序展示a*搜索求解8数码问题的搜索过程。 基本要求:程序可以单独运行(脱

一致代价搜索的伪代码，python格式

python制作一个框架，可用BFS DFS与A*三种方法解决九宫重排问题，要求初始状态和最终状态由自己决定

上述代码报错'list' object has no attribute 'state'

上述代码依旧报错'list' object has no attribute 'state'

依旧报错'list' object has no attribute 'state'

最新资源

用python程序展示a*搜索求解8数码问题的搜索过程。基本要求:程序可以单独运行(脱