The Problem with AlphaZero for Imperfect Information Games

The ReBeL[1] Paper presents an example of the limitations of AlphaZero[2]-like algorithms in imperfect information games, borrowed from "Depth-Limited Solving for Imperfect-Information Games"[3]:

ReBeL paper's example of how AlphaZero Fails

In short [1]:

"This [example] illustrates a critical challenge of imperfect-information games: unlike perfect-information games and single-agent settings, the value of an action may depend on the probability it is chosen."

This is a valuable example, but there is actually an even simpler example of how AlphaZero breaks down in the imperfect information context, even without an adversarial opponent.

Let us consider the following environment:

Step 1. The agent is randomly assigned to the Heads or Tails world (50% probability), but this information is hidden from the agent.

Step 2. The agent is given a chance to take a gamble. If they choose not to, the episode ends immediately with a reward of 0.

Step 3. If the agent chose to gamble, they now have a choice to guess Heads or Tails. If they guess correctly, they get a reward of 1. If they guess incorrectly, they get a reward of -2.

Clearly, since the expected value of the game is -0.5, the agent should always refuse the gamble. However, the AlphaZero algorithm naively applied to this environment will arrive at the wrong conclusion, even though we have a perfect simulator, and there is no adversarial opponent (i.e. the value of an action does not depend on the probability it is chosen).

We can see how this happens by looking at the full game tree. Note that after the first step, the environment is entirely deterministic. Therefore, if we perform the search to full depth, AlphaZero's estimated value for each node will quickly converge to the true value.

We can see that AlphaZero will conclude that in the Heads world, taking the gamble and choosing Heads will always result in a payoff of 1, and in the Tails world, taking the gamble and choosing Tails will result in a payoff of 1 as well. Thus, based on the search it performed, it appears that the best action is always to take the gamble. The search algorithm is effectively "cheating", because it knows the state of the hidden information.

The root of this problem is that we are performing our search over game states instead of over information states. Performing the search over information states would transform our problem from a deterministic, imperfect information environment into a stochastic, perfect information environment, which would allow AlphaZero to work correctly. In a simple game like this example, it is possible to work out an exact information state simulator, but in general it is computationally intractable.

A few notes:

1. In our example environment, it's actually trivial to convert our environment from a deterministic imperfect information game into an isomorphic stochastic perfect information game, by delaying the "coin flip" until the user actually makes a choice. However, in general this might not be possible, as the mechanics of the hidden information might be arbitrarily complicated.

2. If the state space of the environment is relatively small, it is possible to mechanistically convert any deterministic simulator of game-state space into a information-state simulator, by sampling enough games that you get good coverage of all information states. All of the samples that correspond to a given information state form a sample that is drawn from the true transition distribution between information states.

3. MuZero's[4] world model is essentially a learned information-state simulator. Note that this allows it to function in an imperfect information environment, but that does not allow it to work in an adversarial environment. That is, it will solve our example problem above, but will still fail on the ReBeL paper's example.

4. This sort of environment structure is present in the game Hearthstone Battlegrounds. The optimal action usually depends on the hidden information of the opponent's board. Because you gain partial information about your opponent's board throughout the game, it is not possible to transform this hidden information into an equivalent perfect-information "stochastic" environment. (Note that this is a problem even for deterministic opponents which do not react to your actions. Considering strategically adversarial opponents complicates things further).

5. Despite these problems, [5] reports reasonable performance applying AlphaZero to regular (non-battlegrounds) Hearthstone, which has imperfect information.

6. In a game with larger state space, but relatively simple rules, it might be possible to use an SMT sampler (e.g. [5]) to automatically create an information-state simulator from the rules of the game. I have not seen anyone try this approach yet!

[1] https://arxiv.org/abs/2007.13544

[2] https://arxiv.org/abs/1712.01815

[3] https://arxiv.org/abs/1805.08195

[4] https://arxiv.org/abs/1911.08265

[5] "Self-play Reinforcement Learning for Hearthstone AI" Wayne Yang, Blizzard Entertainment, https://www.microsoft.com/en-us/research/video/ai-and-gaming-research-summit-2021-ai-agents-day-2-track-1-1/

[6] https://people.eecs.berkeley.edu/~ksen/papers/smtsampler.pdf