Training Mario with Reinforcement Learning

Date Created: November 12, 2024

Date Modified:

Project header image - Mario

In a lovely day, I asked myself, how can I make a computer learn to play Mario? Well, I did just that, and I embarked on this journey to understand reinforcement learning (RL) better. This blog post documents my experiments with training an AI agent to play Super Mario Bros using a Double Deep Q-Network (DDQN).

What is Reinforcement Learning?

Reinforcement learning (RL) is a type of machine learning where an agent learns by interacting with its environment to achieve a specific goal. The agent takes actions and receives rewards or penalties based on the effectiveness of those actions. Over time, the agent uses this feedback to adjust its behavior, aiming to maximise cumulative rewards.

RL is commonly used in gaming, robotics, finance, or any setting where sequential decision-making under uncertainty is required. For example, an RL agent can learn to play video games by trying different strategies and learning from the outcomes. A famous example is AlphaGo, which used RL to beat world champions in the game of Go. There are also power agents in games like AlphaStar in StarCraft and OpenAI Five in Dota 2 to play at superhuman levels.

In the case of Mario, the agent's actions are moving left or right, jumping, or shooting fireballs.

The basic elements of reinforcement learning include:

  1. Agent: The learner or decision-maker that interacts with the environment.
  2. Environment: The setting in which the agent operates.
  3. Action: Choices the agent can make.
  4. State: The current situation or context in which the agent finds itself.
  5. Reward: Feedback given to the agent to indicate success or failure.
Agent-environment interaction loop

"For any given state, an agent can choose to do the most optimal action (exploit) or a random action (explore)." This is something that agent has to learn to make better decision. This trade-off between exploration and exploitation is a key challenge in reinforcement learning.

Initial Setup

Note: This blog follows the instructions from this tutorial. Discussion and modifications that entails will be attempted to make sense of the article and the code.

Setting up the environment was quite the adventure. If you've worked with Python packages before, you know the usual suspects - version conflicts, deprecation warnings, and the occasional "this doesn't work like it used to" moments.

After a few hours of debugging and package juggling, I finally got everything working together.

To list the whole dependencies I used would be a bit much, but here are the main ones:


pytorch=2.4.1=py3.8_cuda12.4_cudnn9_0
torchrl=0.5.0
torchvision=0.20.0
gym=0.26.1
gym-super-mario-bros=7.4.0
numpy=1.24.4
matplotlib-base=3.7.3
        

You can see that I used CUDA and cuDNN for GPU acceleration. To find the right versions, I recommend checking here, the official PyTorch website, or using miniconda conda-forge channel for up-to-date packages. Once you have the right versions, it will look something like this:

CUDA device confirmation

Initialise Environment

In Mario, tubes, mushrooms, etc... are the components of the environment. This is where the agent will interact with the game world, taking actions and receiving rewards based on its performance.


# Initialise Super Mario environment (in v0.26 change render mode to 'human' to see results on the screen)
if gym.__version__ < '0.26':
    env = gym_super_mario_bros.make("SuperMarioBros-1-1-v0", new_step_api=True)
else:
    env = gym_super_mario_bros.make("SuperMarioBros-1-1-v0", render_mode='rgb', apply_api_compatibility=True)

# Define the movement options for Super Mario
MOVEMENT_OPTIONS = [
    ["right"],        # Move right
    ["right", "A"],   # Jump right
]

# Apply the wrapper to the environment
env = JoypadSpace(env, MOVEMENT_OPTIONS)

env.reset()
next_state, reward, done, trunc, info = env.step(action=0)
print(f"{next_state.shape},\n {reward},\n {done},\n {info}")
        

This code snippet initialises the Super Mario environment, sets up the movement options, and applies a wrapper to the environment, which will helps the agent interact with the environment by providing a set of actions it can take. As a test, the code prints out the next state, reward, done flag, and info after taking the first action in the environment.

(240, 256, 3),
0.0,
False,
{'coins': 0, 'flag_get': False, 'life': 2, 'score': 0, 'stage': 1, 'status': 'small', 'time': 400, 'world': 1, 'x_pos': 40, 'y_pos': 79}

When you call env.step(action=0), you're telling Mario to perform action 0, which from the MOVEMENT_OPTIONS list is moving right. Of course, you can change the action to 1, which is a jump right. The function returns the next state (a 240x256x3 image), the reward (0.0 in this case), a done flag (False, meaning the episode is not over), and some additional information about the environment. This is something that Mario will probably see (for illustration purposes):

Mario screen

Think of it like pressing the right button for a split second, checks if you get a reward, game over or not, and then returns the new state of the game. This is a single step in the game, the interaction loop will continue until the game is over or the AI beat the game.

Pre-process the Environment

In the previous output, the next state was a 240x256x3 image which is returned by the environment. Often, this is too much information for the agent to process directly. Mario does not need to see the entire screen to make decisions. Instead, we will apply wrappers to the environment to pre-process the images and make them more manageable for the agent.

Environment wrapper Frame transformation

The final output is smaller by almost 85%, which means faster processing and less memory usage. This is much simpler than what a human sees, but contains all the essential information needed to learn how to play the game effectively.

Replay Buffer

A replay buffer is like a "memory bank" that stores the agent's experiences while it plays the game. Each experience consists of: The current state (what Mario sees), the action taken, the reward received, the next state, and whether the game ended (done flag). Something like this:

Replay buffer structure

In this tutorial, the author uses TensorDictReplayBuffer with LazyMemmapStorage, which is a custom replay buffer implementation. The replay buffer stores experiences (state, action, reward, next state, done) and samples mini-batches for training the agent. Only, I failed to get it running.

OSError: [WinError 1455] The paging file is too small for this operation to complete

Apparently, this is a Windows-specific error occurs when the system tries to allocate more virtual memory than is available. Since I don't want to mess with the system too much and I don't have any money for RAM yet, I decided to use a simpler replay buffer implementation and called it SimpleReplayBuffer.

SimpleReplayBuffer TensorDictReplayBuffer with LazyMemmapStorage
Uses simple Python deque for storage More sophisticated memory management
Implements basic prioritised experience replay Uses memory mapping for large datasets
Straightforward memory management More complex data structures
Less feature-rich but more robust More features but more potential points of failure
Memory usage comparison

My SimpleReplayBuffer does 2 key things:

  1. push (Storing Experiences):
    • It takes a snapshot of what happened during one step of Mario's gameplay.
    • It calculates how important this memory is (priority = |reward| + epsilon).
    • It stores the experience in a deque and the priority in a separate deque.
  2. sample (Prioritised Sampling):
    • Think of this like Mario "remembering" past experiences to learn from. He will takes in how many memories to recall (batch_size) and prefers to remember important moments (high rewards).
    • If picking importance memories failed (due to zero probabilities), it will fallback to random memories.
    • Then, it converts chosen memories to format for learning (GPU tensors).

Even though it is essentially the same with TensorDictReplayBuffer, the SimpleReplayBuffer is actually a better solution for my current setup. It is lightweight, easy to understand, and doesn't require any special dependencies. Plus, it is easier to debug and modify if needed.

It's-a me, Mario!

It's time create the Mario-playing agent. Basically, Mario should be able to:

Mario agent overview

Logging

For tracking and visualising the training process, the MetricLogger class is created. It is a comprehensive logging system used to track, and save, and visualise the agent's performance during training.

Metric logger diagram

In essence, these metrics help understand the agent's learning:

Evaluation

I have made a evaluation script in order to watch MarioNet plays. Because after all, the goal is to see Mario beat the game by itself, right?

Mario evaluation flow

The script runs in human-visible mode (render_mode="human") and records video, making it easy to analyse the agent's behavior.

The evaluation process is somewhat similar to the training process. The eval script still initiate environment setup and MarioNet, but it will load the checkpoint provided by the training process earlier. Then, we implement gameplay loop, run the Mario session played by the model, and record its performance. This is where you can see how well Mario is performing and if he's improving over time.

Mario Gameplay - Recorded on November 17, 2024

As you can see, he's stuck. But that's okay, it's part of the learning process. The agent will learn from these experiences and improve over time.

The script will track several metrics to demonstrate its playing styles and identify areas for improvement: Total steps taken, final x-position reached, chosen actions, whether Mario reached the flag, and average speed (pixels/step).

Training Process and Parameter Optimisation

Training a reinforcement learning agent to play Super Mario Bros is computationally intensive. With my RTX 3060 Ti 8GB GPU setup, I had to make several compromises and optimisations to make the training feasible. Here's what I learned from two different training approaches.

Training Approach Results
First Training Run: Fast but Limited
  • Duration: 1-2 hours
  • Episodes: 1000
  • Action Space: 2 actions (only jump right and move right)
  • Parameters: More aggressive learning rates and exploration decay
Second Training Run: Longer but More Complex
  • Duration: 10-11 hours
  • Episodes: 5000
  • Action Space: 6 actions (no move, right, run right, jump right, run+jump right, charge)
  • Parameters: More conservative settings for stability

# First Run (Aggressive)
batch_size = 32                      # Smaller batch size for faster learning
exploration_rate_decay = 0.99999975  # Slower decay for more exploration
gamma = 0.90                         # Higher discount factor for long-term rewards
learning_rate = 0.00025              # Faster or slower learning
SimpleReplayBuffer(100000, ...)      # Larger buffer for more experiences

# Second Run (Conservative)
batch_size = 256                     # Larger batch size for more stable learning
exploration_rate_decay = 0.99999
gamma = 0.95 
learning_rate = 0.0005
SimpleReplayBuffer(50000, ...)       # Smaller buffer for less memory usage
    

After messing around, here are what I conclude:

Certainly, I will continue to experiment with different hyperparameters and training strategies to improve the agent's performance. What I have in mind at first is using the 2-action model and train it in smaller episodes. I will also try to explore extreme parameters to force the model to converge faster, since hardware is limited in a way.

I will update with more findings in part 2 soon enough. In the mean time, you can check out this project in a concise manner or my other projects.

References

  1. Van Hasselt, Hado & Guez, Arthur & Silver, David. (2015). Deep Reinforcement Learning with Double Q-Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 30. http://arxiv.org/pdf/1509.06461.pdf
  2. Feng, Yuansong & Subramanian, Suraj & Wang, Howard & Guo, Steven. Train a Mario-playing RL Agent. Available at: https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html
  3. OpenAI Spinning Up tutorial: https://spinningup.openai.com/en/latest/

Back to Home