Stone Ground Hearth Battles: The First Year

Summary

Stone Ground Hearth Battles (https://github.com/JDBumgardner/stone_ground_hearth_battles) is a project to create a simulator for Hearthstone Battlegrounds, and apply reinforcement learning to create a superhuman agent. We currently have a nearly complete simulator, and can train a beginner level agent.

Origins

Stone Ground Hearth Battles started around the beginning of 2020, suggested by a friend jazzcs who was an avid player of the game, with a peak ELO of ~10k (we also considered TFT, but we thought simulating it would be much harder). I also had a couple of other friends who were learning to code, or learning machine learning. The project was initially conceived as an attempt to apply the same techniques as the Alphastar/Dota2 papers to Hearthstone Battlegrounds, creating an AI agent that can crush all human opponents. Hopefully, along the way we would learn something, regardless of whether we succeeded.

Simulator

Our first step was to begin developing a simulator. Since Hearthstone Battlegrounds does not have an API, we instead wrote a simulator of the whole Hearthstone Battlegrounds game from scratch. At the time, we did not see any other open source simulators available online for the full game, although there were a few battle simulators.

We wrote our simulator in Python, mainly because of how easy it is to learn the language. Some of the main contributors were relatively new to programming, and were using this project as a vehicle to improve their skills. It also made the integration with deep learning libraries (which are usually in python) simpler.

One of the nice things about writing our own simulator was that we could design the interface to the simulation to be amenable to learning. For the most part, we followed the original design of the game, but we made a few changes to the action space to simplify things:

1. The hearthstone interface allows you to rearrange cards arbitrarily at any point during your turn. However, this has no effect except at the end of your turn, or when you play a card with battlecry/magnetic. We instead only allow the user to rearrange their cards once at the end of their turn. To handle battlecries/magnetic, we have the agent specify which cards they want to target, instead of which position to play the card. This action space is isomorphic to the real game, but is simpler and contains less redundancy.

2. The hearthstone interface has a single button for freezing/unfreezing. However, like rearranging your cards, freezing has no effect until the end of your turn. (i.e. freezing and then unfreezing has no effect). We therefore only allow the agent to make the freeze decision once, at the end of their turn. (There is a new card which complicates this slightly, but we still restrict the action to the end of the turn).

We tried to write the simulator in a reusable way, so that it can handle human and AI agents, text based and graphical interfaces, local and networked agents, and random or deterministic behavior. Because we had knowledge of the requirements from the beginning, we found it easier to design a properly functioning system. In fact, many of our issues matching the original game were derived from buggy behavior in the original game, than has become an important mechanic. The core of the simulator is an imperative interface, allowing us to apply an action to the current state of the game (player.py and tavern.py). Events are handled through callbacks, instead of an event queue, which gives us nice stack traces. On top of this core is layered an abstraction of the agent, which has an object representing actions, and hosts, which allow a set of agents to play a game (round_robin_host.py host.py). The randomness used by the game is abstracted out, allowing us to choose a deterministic outcome for unit tests, have random outcomes while playing the game, and have repeatable random outcomes with replays. At some point, we switched our agent interface to use async, to add support for playing games over the network (play_tcp_game.py).

Once we had an initial framework set up for the simulator, we began to work on the RL parts of our code at the same time as we expanded the set of cards and heros implement in our simulator. ethansaxenian took leadership over the simulator at this point, and implemented the vast majority of cards and heroes, keeping on top of patches as they released (currently only 1 hero and card are unimplemented). More often than not, our simulator could handle new cards without any changes to the core implementation, since we had an event system that allowed cards to react to events in a custom way (card_pool.py). However, some of the time cards would introduce a new mechanic which would require adding new functionality to our core simulator.

Note that sometimes the behavior of the game was a bit arcane. Luckily, we had an expert to consult, and we could run experiments to determine the correct behavior when stumped.

We also had several heuristic agents that we implemented before beginning our RL journey, to serve as baselines. These bots did simple things like assign scores to how good each card is based on attack, toughness, monster type, etc, and always buying the best card, selling the worst card, rerolling and upgrading, etc. This served as a good testing ground for the bot interface to our game. We also created a small ladder using ELO (later trueskill) to rank these bots amongst each other, and minorly compete over who can write the best bot.

Learning RL

We started from almost zero background in RL. I had some background with machine learning, but most contributors had none, or were new to programming altogether. We held informal reading groups to learn the material, with much of the time spent passing information between group members rather than learning entirely new material (however, anyone who has tried it can attest that teaching is the best way to learn).

We started our studying at the AlphaStar paper, and basically built a "stack" of papers to review based on all the missing knowledge we needed. Our most helpful resource was Lilian Weng's page on Policy Gradients https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html . It took days staring at the equations and thinking about them before it started to make sense. Because we had no time limit, we took our time and recursively looked up anything we didn't understand, rather than pushing onward when we were confused.

Our basic conclusion after reading most of the AlphaStar and Dota2 paper was that they were using quite similar methods, and (as has been remarked in the popular press) the methods were not especially complicated. We expected that AlphaStar would be similar to AlphaZero, but it really only shares the name. Both AlphaStar and OpenAI Dota2 implement a basic policy gradient algorithm with "near-policy" off-policy corrections (V-TRACE/IMPALA and PPO, respectively), with additional "stuff" added on top (supervised learning/"The League" and reward shaping/domain randomization, respectively). (Of course this is an oversimplification, e.g. AlphaStar also had UPGO). This made it clear to us that our approach would probably be based on V-TRACE or PPO as well. We chose PPO, since it seems to have gotten more community adoption.

As we were popping papers off the stack, I implemented basic REINFORCE, A2C, and PPO algorithms. The REINFORCE and A2C implementations were mostly abandoned though (yes, yes, I know it is bad form to have a single environment and a single learning algorithm that only work with each other). I chose Pytorch for implementation, because I had heard great things about it, although I had only used Tensorflow previously. Yes it is better than Tensorflow (mainly because there aren't 5 confusing ways to do the same thing).

Implementing RL

Although our initial plan was to plug the simulator into an existing verified RL implementation. This is what everyone suggests, and this is the best way to get a working implementation to build on, without wasting your time debugging mistakes that everyone else made. However, it was too much fun to implement it on our own.

We knew based on the AlphaStar paper that we wanted to match the structure of our neural net to the structure of the game. We knew feeding our observations into a fully connected MLP was not going to cut it, since it would lose all understanding of which actions corresponded to which observations, and also lose the ability to generalize between cards in different positions of your hand, or in your hand vs on the board. Thus, Transformers!

We looked into OpenSpiel as a framework for our bot, but it did not seem to support this sort of representation out of the box. (We later revisited open_spiel, but found that its model of randomness is not nicely compatible with ours. Probably we will see if PettingZoo is more compatible).

Our PPO implementation was improved incrementally, but the main pieces that seemed to give it life were implementing GAE, and implementing a Transformer network instead of fully connected layers (this may have also been because it coincided with when I conducted extensive debugging). John Schulman's talk Nuts and Bolts of Deep RL Experimentation was quite helpful at giving us ideas of what metrics to monitor. Also want to give a shout out to Implementation Matters in Deep RL: A Case Study on PPO and TRPO and What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study for giving better details on how to implement PPO, and nice hyperparameter information.

I also implemented a Tensorboard plugin to visualize the gameplay of the bot. It actually is a bit generic, and other people may find it reusable, since it just embeds vega charts using vega-embed into Tensorboard. The main logic of the visualization is just encoded into vega-lite plot, using Altair in the python code.

This plugin allowed us to visualize nearly everything about a game, using interactivity. This includes all the actions of the bot, the observations at each step, the policy and value function, and also other numbers that are helpful for debugging, such as the advantage function. The visualization has grown with the game, and with our PPO implementation. It is crucial for us to understand what the bot is learning, and how our learning algorithm is working.

My takeaway from debugging this system is that it is *not* black magic, if you understand conceptually what every part is supposed to be doing. It is only impossible to debug if you're not sure what each part is supposed to do, and you don't have metrics telling you what it is doing. Visualizing everything, especially all the metrics mentioned in https://andyljones.com/posts/rl-debugging.html is crucial, and you should have an expectation of what each of those metrics should look like when properly functioning. To be clear though, I am not claiming that our implementation is bug-free. The next steps for the project are planned around debugging and validating our implementation, not adding functionality.

Our current setup is basically ignoring the multi-agent rock-paper-scissors nature of the game. We just run vanilla PPO, using a pool of past agents as opponents. (We actually usually set the pool size to 7, so that there is not noise in the rewards based on the difficulty of opponents. Using something more sophisticated will be necessary if we want to train against a larger variety of strategies). Whenever our agents trueskill score is outside of the confidence interval of the next best agent, we replace another agen in the pool with a copy of the current learning bot.

Network Architecture

I should make a network architecture diagram. If someone bugs me about it I will. But this text description will hopefully be an approximate replacement:

We encode each card with a fixed set of features (attack, toughness, type, etc), and the player (such as player health, coins, tavern tier, etc) as another set of features. Since we want to use a Transformer network (in a lazy way), we want to map the player features and the card features into a common vector space. In our case we use a single linear layer to map to a common embedding of dimension D (with a single indicator dimension which is only 1 for the player embedding). Then, we apply N layers of Post-Norm Transformer layers (inspired by GTrXL, but too lazy to implement their gating mechanism as well).

At the output, we have 24 (7 in store, 10 in hand, 7 on board) card representations of dimension D, and 1 player representation of dimension D. We apply a linear layer to the card representation to get the logits for that cards actions, and a linear layer to the player representation to get the logits for the global actions (i.e. any actions not associated with a specific card). Then these logits are fed into a final softmax layer for calculating the action probabilities. For the value network, we use the same architecture, except we have a single linear layer on only the player's representation. We do support sharing the policy and value network, but have mostly kept them separate.

We also optionally concatenate the player representation to each card representation, which makes the transformer attention mechanism less important for propagating information about the player to the action choices for a specific card. We can even have a "zero layer transformer" which is a simplified version of our network, eliminating the transformer layers, but the card-level actions would still be aware of the player's state.

Note that while our simulator supports actions which target multiple cards (e.g. playing a card with a battlecry), the encoding for these actions is not yet supported for our bot, so our bot is restricted to targeting the first minion on the board with battlecries. This is a significant restriction and eliminates many strategies from viability.

For the rearrange actions, we use a Plackett-Luce distribution. This assigns a score to each card (taken as a linear layer on top of the card representations output by the transformer layer), and the chosen card rearrangement is sampled by iteratively choosing the next card proportional to the exponential of its score. We have implemented the placket luce distribution in a way that supports batched inference over permutations of different sizes. This is important, because the size of the permutation depends on the number of cards on the board, so it is necessary to support batched inference on our bot. I have offered our implementation to be contributed to Pytorch: https://github.com/pytorch/pytorch/pull/50362.

Results

Our bot is able to learn the basic strategy of the game: Buy cards with good stats, play cards with good stats, sell cards with good stats, upgrade the tavern, and reroll otherwise. However, it is as of yet unable to learn more advanced strategies that rely on powerful combos. Here is an example final board, showing its ability to construct a powerful board, but its inability to discover a good strategy.

[{MythraxTheUnraveler 35/66 (t5)}, {SaltyLooter 12/12 (t3)}, {KalecgosArcaneAspect 8/16 (t6)}, {SneedsOldShredder 12/15 (t5), [deathrattle-0], [golden]}, {TheTideRazor 9/4 (t6), [deathrattle-0]}, {Amalgadon 10/10 (t6), [battlecry], [windfury]}, {Murozond 6/6 (t5), [battlecry]}]

It has built its board around MythraxTheUnraveler, which gets buffed based on the variety of minion types the player has on their board. It is natural to see how this would be a strong card for a bot that buys cards without much of a long term strategy. However, note that if one committed to the MythraxTheUnraveler strategy, much higher stats levels could be achieved by completely optimizing for diversity of minions, and by getting a golden Mythrax. (Also note that since the bot can't use battlecries effectively, this may be strongly affecting the strategies it learns).

The bot is able to reach 48 trueskill after a couple of days of training (compared to 20 trueskill for a random bot), but seems to plateau. At this level of skill, our bot is able to beat all of our handcrafted agents.

However, at this point we are unsure if the bottleneck is bugs in our implementation, or the need for more computational power. The level that our skill plateaus at seems to be affected by batch size (I like to call experience buffer size to reduce confusion), and learning rate. We may just be hitting the computational limits of our setup

Our Cloud Provider

Our cloud provider only offers a single machine, it has an FX-6300 processor, a GTX-980 graphics card, and till recently only had 8 GB of ram. It has frequent downtime, whenever the provider wishes to play video games. The upside is that it's free! This cloud provider is myself of course :) The other contributor who trains the bot doesn't even have a graphics card. However, before you beat down my door telling me to use vast.ai, let me first tell you that we are not using all of the resources available to us on our current host.

Our initial implementation was single threaded. Based on profiling, our bottleneck was not the simulator, but actually the inference. (Think: we are playing 8 versions of the agent vs each other, so each action requires a full network inference).

Adding multi-threading actually turned out to be a bit more difficult than expected. Our first implementation used multiprocessing to distribute the game playing across multiple independent processors. We thought for sure this would provide a speedup, at least for CPU training. However, we found it actually did not! It turns out, Pytorch was already secretly parallelizing our matrix multiplications, and so adding multithreading on top of that was just spawning a huge number of threads that were contending for the same limited number of cores. With CUDA, we also found no speedup, because the CUDA operations now became a bottleneck! (Some sort of locking on CUDA I'm guessing).

In order to get a real speedup, we needed to get the inference to be batched during the game playing, the same way it is batched when performing the optimization steps. Thus I developed our current implementation, where we use multithreading (not multiprocessing) since it is easy to pass around CUDA tensors and other information. The idea is that we can have multiple threads playing the game, sending a request to a single thread which performs the batched CUDA inference. Note that since there are actually 8 different models (one for each agent in the game), only certain requests can be batched together, so we have a queuing system where requests to run inference on the same model are batched together. Note that this means that the inference batch size will be ~1/8th of the number of games that are being played simultaneously (since the turns are not of fixed size, so they quickly become unsynchronized between the different games). We can fit up to 256 simultaneous games in our 4GB of GPU memory before CUDA OOMs. (Note we could probably fit more, but this would require pushing more work to the CPU. However, since we are using multithreading instead of multiprocessing, this would be the new bottleneck).

This setup is a bit faster, but is still quite slow overall. The amortized time to play a single game is 3 seconds (I can hear andyljones crying faintly in the background). After a bit of struggle interpreting the py-spy profile, I realized that our new bottleneck is the GIL. So unfortunately we must now move to multiprocessing. Pytorch distributed RPC looks great and exactly what we need, except it doesn't support CUDA tensors. If anyone has suggestions about what to do instead, I am happy to hear about it :)

The Future

We have a huge number of possible areas to improve on, and areas to investigate. The nice thing about a project with no external pressure is we can work on whatever seems fun at the time. However, there are a few things that I have in the back of my mind that we will hopefully do at some point.

Distributed training. It would be cool to pool our machines together to train in a distributed way. Yes, yes, we could throw money at a cloud provider, but only if our compact of humble machines united in brotherhood to defeat hearthstone proves insufficient.
Integrating with other environments/algorithms. We definitely should compare our implementation to another well tested/respected one, and iron out all the bugs that way. It seems likely to be PettingZoo (and not because of all the advocacy of the author) (okay maybe a little bit because of that). We also should try applying our PPO implementation to other environments as well, which will be another test of whether anything is broken. I think that we may have something to contribute as well, in terms of making a generic factored representation of observation/action space that we do for our transformer net.
Allow multi-card actions. Our bot should be able to use battlecries properly!
Tool for investigating our models. We can look at our tensorboard plugin to see how the policy and value changes over time. But really we should be able to easily experiment about how our policy and value functions would change in response to specific changes in the observations, to really probe the depths of its knowlege/stupidity.
Allow our bot to remember things. For higher levels of play, it is important to remember your opponents boards, because it can help you prepare to fight them next time. We likely should bake this capability into our bot interface. There is also more intangible information about which cards have appeared when rerolling, that I am not sure if it is important to remember. Using an LSTM might be necessary.
Agent League. Like Alphastar, once we get a strong baseline agent, we will need to start training it against a variety of strategies to make it stronger. We may use an exploiter league like AlphaStar, and possibly incentivize different bots to specialize with specific strategies. Maybe domain randomization?
More visualizations. I think it should be possible to go even further in debugging the learning process than what we already have. Looking at the changes in action probabilities in specific states during learning, the magnitude of parameter changes, the effects of different observations on the value function, etc.
A GUI for the game. It would be cool to have a website where you could log in and play against the bot in a web based interface.
Rewrite in Rust (if our simulator becomes the bottleneck, throw it away and rewrite it in Rust). It's what all the cool kids are doing.
Search. We are using methods that work for environments that cannot be simulated. But our environment can be simulated! We can bring our enivironment back to the same state and explore other possible actions. We should be able to exploit this! Our environment is sometimes deterministic (like chess) and sometimes stochastic (like dota). I feel like there is also some math here to be discovered. There is some principled way we could decide the tradeoff between simulating the same states repeatedly vs simulating new states. I can't seem to find a paper discussing this, but there should be! If it doesn't exist, we could write it!

Acknowledgements

https://github.com/ethansaxenian

https://github.com/jeremysalwen (me)

https://github.com/jackrobison
https://github.com/JDBumgardner

https://github.com/AdamSalwen

https://github.com/jazzcs

https://github.com/dianarvp

https://github.com/distilleddata

https://github.com/davityourway

https://github.com/sofiacf

Shout out to /r/reinforcementlearning and the associated discord channel for being a great resource. Special thanks to @swift_truth for the suggestion to write up this summary.

Please don't hesitate to contact me by any means if you are curious or would like to contribute! I am active on the RL discord channel.

Search This Blog

Stone Ground Hearth Battles