Stone Ground Hearth Battles: The First Year
Summary
Stone Ground Hearth Battles (https://github.com/JDBumgardner/stone_ground_hearth_battles) is a project to create a simulator for Hearthstone Battlegrounds, and apply reinforcement learning to create a superhuman agent. We currently have a nearly complete simulator, and can train a beginner level agent.
Origins
Stone Ground Hearth Battles started around the beginning of 2020, suggested by a friend jazzcs who was an avid player of the game, with a peak ELO of ~10k (we also considered TFT, but we thought simulating it would be much harder). I also had a couple of other friends who were learning to code, or learning machine learning. The project was initially conceived as an attempt to apply the same techniques as the Alphastar/Dota2 papers to Hearthstone Battlegrounds, creating an AI agent that can crush all human opponents. Hopefully, along the way we would learn something, regardless of whether we succeeded.
Simulator
Our first step was to begin developing a simulator. Since Hearthstone Battlegrounds does not have an API, we instead wrote a simulator of the whole Hearthstone Battlegrounds game from scratch. At the time, we did not see any other open source simulators available online for the full game, although there were a few battle simulators.
We wrote our simulator in Python, mainly because of how easy it is to learn the language. Some of the main contributors were relatively new to programming, and were using this project as a vehicle to improve their skills. It also made the integration with deep learning libraries (which are usually in python) simpler.
One of the nice things about writing our own simulator was that we could design the interface to the simulation to be amenable to learning. For the most part, we followed the original design of the game, but we made a few changes to the action space to simplify things:Learning RL
Implementing RL
Although our initial plan was to plug the simulator into an existing verified RL implementation. This is what everyone suggests, and this is the best way to get a working implementation to build on, without wasting your time debugging mistakes that everyone else made. However, it was too much fun to implement it on our own.
We knew based on the AlphaStar paper that we wanted to match the structure of our neural net to the structure of the game. We knew feeding our observations into a fully connected MLP was not going to cut it, since it would lose all understanding of which actions corresponded to which observations, and also lose the ability to generalize between cards in different positions of your hand, or in your hand vs on the board. Thus, Transformers!
We looked into OpenSpiel as a framework for our bot, but it did not seem to support this sort of representation out of the box. (We later revisited open_spiel, but found that its model of randomness is not nicely compatible with ours. Probably we will see if PettingZoo is more compatible).
Our PPO implementation was improved incrementally, but the main pieces that seemed to give it life were implementing GAE, and implementing a Transformer network instead of fully connected layers (this may have also been because it coincided with when I conducted extensive debugging). John Schulman's talk Nuts and Bolts of Deep RL Experimentation was quite helpful at giving us ideas of what metrics to monitor. Also want to give a shout out to Implementation Matters in Deep RL: A Case Study on PPO and TRPO and What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study for giving better details on how to implement PPO, and nice hyperparameter information.
I also implemented a Tensorboard plugin to visualize the gameplay of the bot. It actually is a bit generic, and other people may find it reusable, since it just embeds vega charts using vega-embed into Tensorboard. The main logic of the visualization is just encoded into vega-lite plot, using Altair in the python code.
This plugin allowed us to visualize nearly everything about a game, using interactivity. This includes all the actions of the bot, the observations at each step, the policy and value function, and also other numbers that are helpful for debugging, such as the advantage function. The visualization has grown with the game, and with our PPO implementation. It is crucial for us to understand what the bot is learning, and how our learning algorithm is working.
My takeaway from debugging this system is that it is *not* black magic, if you understand conceptually what every part is supposed to be doing. It is only impossible to debug if you're not sure what each part is supposed to do, and you don't have metrics telling you what it is doing. Visualizing everything, especially all the metrics mentioned in https://andyljones.com/posts/rl-debugging.html is crucial, and you should have an expectation of what each of those metrics should look like when properly functioning. To be clear though, I am not claiming that our implementation is bug-free. The next steps for the project are planned around debugging and validating our implementation, not adding functionality.
Our current setup is basically ignoring the multi-agent rock-paper-scissors nature of the game. We just run vanilla PPO, using a pool of past agents as opponents. (We actually usually set the pool size to 7, so that there is not noise in the rewards based on the difficulty of opponents. Using something more sophisticated will be necessary if we want to train against a larger variety of strategies). Whenever our agents trueskill score is outside of the confidence interval of the next best agent, we replace another agen in the pool with a copy of the current learning bot.
Network Architecture
I should make a network architecture diagram. If someone bugs me about it I will. But this text description will hopefully be an approximate replacement:
We encode each card with a fixed set of features (attack, toughness, type, etc), and the player (such as player health, coins, tavern tier, etc) as another set of features. Since we want to use a Transformer network (in a lazy way), we want to map the player features and the card features into a common vector space. In our case we use a single linear layer to map to a common embedding of dimension D (with a single indicator dimension which is only 1 for the player embedding). Then, we apply N layers of Post-Norm Transformer layers (inspired by GTrXL, but too lazy to implement their gating mechanism as well).
At the output, we have 24 (7 in store, 10 in hand, 7 on board) card representations of dimension D, and 1 player representation of dimension D. We apply a linear layer to the card representation to get the logits for that cards actions, and a linear layer to the player representation to get the logits for the global actions (i.e. any actions not associated with a specific card). Then these logits are fed into a final softmax layer for calculating the action probabilities. For the value network, we use the same architecture, except we have a single linear layer on only the player's representation. We do support sharing the policy and value network, but have mostly kept them separate.
We also optionally concatenate the player representation to each card representation, which makes the transformer attention mechanism less important for propagating information about the player to the action choices for a specific card. We can even have a "zero layer transformer" which is a simplified version of our network, eliminating the transformer layers, but the card-level actions would still be aware of the player's state.
Note that while our simulator supports actions which target multiple cards (e.g. playing a card with a battlecry), the encoding for these actions is not yet supported for our bot, so our bot is restricted to targeting the first minion on the board with battlecries. This is a significant restriction and eliminates many strategies from viability.
For the rearrange actions, we use a Plackett-Luce distribution. This assigns a score to each card (taken as a linear layer on top of the card representations output by the transformer layer), and the chosen card rearrangement is sampled by iteratively choosing the next card proportional to the exponential of its score. We have implemented the placket luce distribution in a way that supports batched inference over permutations of different sizes. This is important, because the size of the permutation depends on the number of cards on the board, so it is necessary to support batched inference on our bot. I have offered our implementation to be contributed to Pytorch: https://github.com/pytorch/pytorch/pull/50362.
Results
[{MythraxTheUnraveler 35/66 (t5)}, {SaltyLooter 12/12 (t3)}, {KalecgosArcaneAspect 8/16 (t6)}, {SneedsOldShredder 12/15 (t5), [deathrattle-0], [golden]}, {TheTideRazor 9/4 (t6), [deathrattle-0]}, {Amalgadon 10/10 (t6), [battlecry], [windfury]}, {Murozond 6/6 (t5), [battlecry]}]
Our Cloud Provider
The Future
We have a huge number of possible areas to improve on, and areas to investigate. The nice thing about a project with no external pressure is we can work on whatever seems fun at the time. However, there are a few things that I have in the back of my mind that we will hopefully do at some point.
- Distributed training. It would be cool to pool our machines together to train in a distributed way. Yes, yes, we could throw money at a cloud provider, but only if our compact of humble machines united in brotherhood to defeat hearthstone proves insufficient.
- Integrating with other environments/algorithms. We definitely should compare our implementation to another well tested/respected one, and iron out all the bugs that way. It seems likely to be PettingZoo (and not because of all the advocacy of the author) (okay maybe a little bit because of that). We also should try applying our PPO implementation to other environments as well, which will be another test of whether anything is broken. I think that we may have something to contribute as well, in terms of making a generic factored representation of observation/action space that we do for our transformer net.
- Allow multi-card actions. Our bot should be able to use battlecries properly!
- Tool for investigating our models. We can look at our tensorboard plugin to see how the policy and value changes over time. But really we should be able to easily experiment about how our policy and value functions would change in response to specific changes in the observations, to really probe the depths of its knowlege/stupidity.
- Allow our bot to remember things. For higher levels of play, it is important to remember your opponents boards, because it can help you prepare to fight them next time. We likely should bake this capability into our bot interface. There is also more intangible information about which cards have appeared when rerolling, that I am not sure if it is important to remember. Using an LSTM might be necessary.
- Agent League. Like Alphastar, once we get a strong baseline agent, we will need to start training it against a variety of strategies to make it stronger. We may use an exploiter league like AlphaStar, and possibly incentivize different bots to specialize with specific strategies. Maybe domain randomization?
- More visualizations. I think it should be possible to go even further in debugging the learning process than what we already have. Looking at the changes in action probabilities in specific states during learning, the magnitude of parameter changes, the effects of different observations on the value function, etc.
- A GUI for the game. It would be cool to have a website where you could log in and play against the bot in a web based interface.
- Rewrite in Rust (if our simulator becomes the bottleneck, throw it away and rewrite it in Rust). It's what all the cool kids are doing.
- Search. We are using methods that work for environments that cannot be simulated. But our environment can be simulated! We can bring our enivironment back to the same state and explore other possible actions. We should be able to exploit this! Our environment is sometimes deterministic (like chess) and sometimes stochastic (like dota). I feel like there is also some math here to be discovered. There is some principled way we could decide the tradeoff between simulating the same states repeatedly vs simulating new states. I can't seem to find a paper discussing this, but there should be! If it doesn't exist, we could write it!
Acknowledgements
https://github.com/JDBumgardner
Comments
Post a Comment