| # **Abstract** | |
| On January 1, 2013, DeepMind published a paper called "Playing Atari | |
| with Deep Reinforcement Learning" introducing their algorithm called | |
| Deep Q-Network (DQN) which revolutionized the field of reinforcement | |
| learning. For the first time they had brought together Deep Learning and | |
| Q-learning and showed impressive results applying deep reinforcement | |
| learning to Atari games with their agents performing at or over human | |
| level expertise in almost all the games trained on. | |
| A Deep Q-Network utilizes a deep neural network to estimate the q-values | |
| for each action, allowing the policy to select the action with the | |
| maximum q-values. This use of deep neural network to get q-values was | |
| immensely superior to implementing q-table look-ups and widened the | |
| applicability of q-learning to more complex reinforcement learning | |
| environments. | |
| While revolutionary, the original version of DQN had a few problems, | |
| especially its slow/inefficient learning process. Over these past 9 | |
| years, a few improved versions of DQNs have become popular. This project | |
| is an attempt to study the effectiveness of a few of these DQN flavors, | |
| what problems they solve and compare their performance in the same | |
| reinforcement learning environment. | |
| # Deep Q-Networks and its flavors | |
| - **Vanilla DQN** | |
| The vanilla (original) DQN uses 2 neural networks: the **online** | |
| network and the **target** network. The online network is the main | |
| neural network that the agent uses to select the best action for a | |
| given state. The target neural network is usually a copy of the | |
| online network. It is used to get the "target" q-values for each | |
| action for a particular state. i.e. During the learning phase, since | |
| we don’t have actual ground truths for future q-values, these | |
| q-values from the target network will be used as labels optimize the | |
| network. | |
| The target network calculates the target q-values by using the | |
| following Bellman equation: \[\begin{aligned} | |
| Q(s_t, a_t) = | |
| r_{t+1} + \gamma \max _{a_{t+1} \in A} Q(s_{t+1}, a_{t+1}) | |
| \end{aligned}\] where, | |
| \(Q(s_t, a_t)\) = The target q-value (ground truth) for a past | |
| experience in the replay memory | |
| \(r_{t+1}\)= The reward that was obtained for taking the chosen | |
| action in that particular experience | |
| \(\gamma\)= The discount factor for future rewards | |
| \(Q(s_{t+1}, a_{t+1})\) = The q-value for best action (based on the | |
| policy) for the next state for that particular experience | |
| - **Double DQN** | |
| One of the problems with vanilla DQN is the way it calculates its | |
| target values (ground-truth). We can see from the bellman equation | |
| above that the target network uses the **max** q-value directly in | |
| the equation. This is found to almost always overestimate the | |
| q-value because using the **max** function introduces the | |
| maximization-bias to our estimates. Using max will give the largest | |
| value even if that specific max value was an outlier, thus skewing | |
| our estimates. | |
| The Double DQN solves this problem by changing the original | |
| algorithm to the following: | |
| 1. Instead of using the **max** function, first use the online | |
| network to estimate the best action for the next state | |
| 2. Calculate target q-values for the next state for each possible | |
| action using the target network | |
| 3. From the q-values calculated by the target network, use the | |
| q-value of the action chosen in step 1. | |
| This can be represented by the following equation: \[\begin{aligned} | |
| Q(s_t, a_t) = | |
| r_{t+1} + \gamma Q_{target}(s_{t+1}, a'_{t+1}) | |
| \end{aligned}\] where, \[\begin{aligned} | |
| a'_{t+1} = argmax({Q_{online}(s_{t+1})}) | |
| \end{aligned}\] | |
| - **Dueling DQN** | |
| The Dueling DQN algorithm was an attempt to improve upon the | |
| original DQN algorithm by changing the architecture of the neural | |
| network used in Deep Q-learning. The Duelling DQN algorithm splits | |
| the last layer of the DQN into to parts, a **value stream** and an | |
| **advantage stream**, the outputs of which are aggregated in an | |
| aggregating layer that gives the final q-value. One of the main | |
| problems with the original DQN algorithm was that the difference in | |
| Q-values for the actions were often very close. Thus, selecting the | |
| action with the max q-value might always not be the best action to | |
| take. The Dueling DQN attempts to mitigate this by using advantage, | |
| which is a measure of how better an action is compared to other | |
| actions for a given state. The value stream, on the other hand, | |
| learns how good/bad it is to be in a specific state. eg. Moving | |
| straight towards an obstacle in a racing game, being in the path of | |
| a projectile in Space Invaders, etc. Instead of learning to predict | |
| a single q-value, by separating into value and advantage streams | |
| helps the network generalize better. | |
|  | |
| Fig: The Dueling DQN architecture (Image taken from the original | |
| paper by Wang et al.) | |
| The q-value in a Dueling DQN architecture is given by | |
| \[\begin{aligned} | |
| Q(s_t, a_t) = V(s_t) + A(a) | |
| \end{aligned}\] where, | |
| V(s\_t) = The value of the current state (how advantageous it is to | |
| be in that state) | |
| A(a) =The advantage of taking action an a at that state | |
| # About the project | |
| My original goal for the project was to train an agent using DQN to | |
| play **Airstriker Genesis**, a space shooting game and evaluate the | |
| same agent’s performance on another similar game called | |
| **Starpilot**. Unfortunately, I was unable to train a decent enough | |
| agent in the first game, which made it meaningless to evaluate it’s | |
| performance on yet another game. | |
| Because I still want to do the original project some time in the | |
| future, to prepare myself for that I thought it would be better to | |
| first learn in-depth about how Deep Q-Networks work, what their | |
| shortcomings are and how they can be improved. This, and for | |
| time-constraint reasons, I have changed my project for this class to | |
| a comparison of various DQN versions. | |
| # Dataset | |
| I used the excellent [Gym](https://github.com/openai/gym) library to | |
| run my environment. A total of 9 agents, 1 in Airstriker Genesis, 4 | |
| in Starpilot and 4 in Lunar Lander were trained. | |
| | **Game** | **Observation Space** | **Action Space** | | |
| | :----------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | |
| | Airstriker Genesis | RGB values of each pixel of the game screen (255, 255, 3) | Discrete(12) representing each of the buttons on the old Atari controllers. But since only three of those buttons were used in the game the action space was reduced to 3 during training. ( Left, Right, Fire ) | | |
| | Starpilot | RGB values of each pixel of the game screen (64, 64, 3) | Discrete(15) representing each of the button combos ( Left, Right, Up, Down, Up + Right, Up + Left, Down + Right, Down + Left, W, A, S, D, Q, E, Do nothing ) | | |
| | Lunar Lander | 8-dimensional vector: ( X-coordinate, Y-coordinate, Linear velocity in X, Linear Velocity in Y, Angle, Angular Velocity, Boolean (Leg 1 in contact with ground), Boolean (Leg 2 in contact with ground) ) | Discrete(4)( Do nothing, Fire left engine, Fire main engine, Fire right engine ) | | |
| **Environment/Libraries**: | |
| Miniconda, Python 3.9, Gym, Pyorch, Numpy, Tensorboard on my | |
| personal Macbook Pro (M1) | |
| # ML Methodology | |
| Each agent was trained using DQN or one of its flavors. Each agent | |
| for a particular game was trained with the same hyperparameters with | |
| just the underlying algorithm different. The following metrics for | |
| each agent were used for evaluation: | |
| - **Epsilon value over each episode** Shows what the exploration | |
| rate was at the end of each episode. | |
| - **Average Q-value for the last 100 episodes** A measure of the | |
| average q-value (for the action chosen) for the last 100 | |
| episodes. | |
| - **Average length for the last 100 episodes** A measure of the | |
| average number of steps taken in each episode | |
| - **Average loss for the last 100 episodes** A measure of loss | |
| during learning in the last 100 episodes (A Huber Loss was used) | |
| - **Average reward for the last 100 episodes** A measure of the | |
| average reward the agent accumulated over the last 100 episodes | |
| ## Preprocessing | |
| For the Airstriker and the Starpilot games: | |
| 1. Changed each frame to grayscale | |
| Since the color shouldn’t matter to the agent, I decided to | |
| change the RGB image to grayscale | |
| 2. Changed observation space shape from (height, width, channels) | |
| to (channels, height, width) to make it compatible with | |
| Pytorch | |
| Apparently Pytorch uses a different format than the direct | |
| output of the gym environment. For this reason, I had to reshape | |
| each observation to match Pytorch’s scheme (this took me a very | |
| long time to figure out, but had an "Aha\!" moment when I | |
| remember you saying something similar in class). | |
| 3. Framestacking | |
| Instead of processing 1 frame at a time, process 4 frames at a | |
| time. This is because just 1 frame is not enough information for | |
| the agent to decide what action to take. | |
| For Lunar Lander, since the reward changes are very drastic (sudden | |
| +100, -100, +200) rewards, I experimented with Reward Clipping | |
| (clipping the rewards to \[-1, 1\] range) but this didn’t seem to | |
| make much difference in my agent’s performance. | |
| # Results | |
| - **Airstriker Genesis** | |
| The loss went down until about 5200 episodes but after that it | |
| stopped going down any further. Consequently the average reward the | |
| agent accumulated over the last 100 episodes pretty much plateaued | |
| after about 5000 episodes. On analysis, I noticed that my | |
| exploration rate at the end of the 7000th episode was still about | |
| 0.65, which means that the agent was taking random actions more than | |
| half of the time. On hindsight, I feel like I should have trained | |
| more, at least until the epsilon value (exploration rate) completely | |
| decayed to 5%. | |
|    | |
| - **Starpilot** | |
| I trained DQN, Double DQN, Dueling DQN and Dueling Double DQN | |
| versions for this game to compare the different algorithms. | |
| From the graph of mean q-values, we can tell that the Vanilla DQN | |
| versions indeed give high q-values, and their Double-DQN couterparts | |
| give lower values, which makes me think that my implementation of | |
| the Double DQN algorithm was OK. I had expected the agent to | |
| accumulate higher rewards starting much earlier for the Double and | |
| Dueling versions, but since the average rewards was almost similar | |
| for all the agents, I could not notice any stark differences between | |
| the performance of each agent. | |
|  | |
|  | |
| | | | | |
| | :------------------ | :------------------ | | |
| |  |  | | |
| - **Lunar Lander** | |
| Since I did gain much insight from the agent in the Starpilot game, | |
| I thought I was not training long enough. So I tried training the | |
| same agents on Lunar Lander, which is a comparatively simpler game | |
| with a smaller observation space and one that a DQN algorithm should | |
| be able converge pretty quickly to (based on comments by other | |
| people in the RL community). | |
|  | |
|  | |
| | | | | |
| | :------------------- | :------------------- | | |
| |  |  | | |
| The results for this were interesting. Although I did not find any | |
| vast difference between the different variations of the DQN | |
| algorithm, I found that the performance of my agent suddenly got | |
| worse at around 300 episodes. Upon researching on why this may have | |
| happened, I learned that DQN agents suffer from **catastrophic | |
| forgetting** i.e. after training extensively, the network suddenly | |
| forgets what it has learned in the past and the starts performing | |
| worse. Initially, I thought this might have been the case, but since | |
| I haven’t trained long enough, and because all models started | |
| performing worse at almost exactly the same episode number, I think | |
| this might be a problem with my code or some hyperparameter that I | |
| used. | |
| Upon checking what the agent was doing in the actual game, I found | |
| that it was playing it very safe and just constantly hovering in the | |
| air, not attempting to land the spaceship (the goal of the agent is | |
| to land within the yellow flags). I thought maybe penalizing the | |
| rewards for taking too many steps in the episode would work, but | |
| that didn’t help either. | |
|  | |
| # Problems Faced | |
| Here are a few of the problems that I faced while training my agents: | |
| - Understanding the various hyperparameters in the algorithm. DQN uses | |
| a lot of moving parts and thus, tuning each parameter was a | |
| difficult task. There were about 8 different hyperparameters (some | |
| correlated) that impacted the agent’s training performance. I | |
| struggled with understanding how each parameter impacted the agent | |
| and also with figuring out how to find optimal values for those. I | |
| ended up tuning them by trial and error. | |
| - I got stuck for a long time figuring out why my convolutional layer | |
| was not working. I didn’t realize that Pytorch has the channels in | |
| the first dimension, and because of that, I was passing huge numbers | |
| like 255 (the height of the image) into the input dimension for a | |
| Conv2D layer. | |
| - I struggled with knowing how long is long enough to realize that a | |
| model is not working. I trained a model on Airstriker Genesis for 14 | |
| hours just to realize later that I had set a parameter incorrectly | |
| and had to retrain all over again. | |
| # What Next? | |
| Although I didn’t get a final working agent for any of the games I | |
| tried, I feel like I have learned a lot about reinforcement learning, | |
| especially about Deep Q-learning. I plan to improve upon this further, | |
| and hopefully get an agent to go far into at least one of the games. | |
| Next time, I will start with first debugging my current code and see if | |
| I have any implementation mistakes. Then I will train them a lot longer | |
| than I did this time and see if it works. While learning about the | |
| different flavors of DQN, I also learned a little about NoisyNet DQN, | |
| Rainbow-DQN and Prioritized Experience Replay. I couln’t implement these | |
| for this project, but I would like to try them out some time soon. | |
| # Lessons Learned | |
| - Reinforcement learning is a very challenging problem. It takes a | |
| substantially large amount of time to train, it is hard to debug and | |
| it is very difficult to tune its hyperparameters just right. It is a | |
| lot different from supervised learning in that there are no actual | |
| labels and thus, this makes optimization very difficult. | |
| - I tried training an agent on the Atari Airstriker Genesis and the | |
| procgen Starpilot game using just the CPU, but this took a very long | |
| time. This is understandable because the inputs are images and using | |
| a GPU would have been obviously better. Next time, I will definitely | |
| try using a GPU to make training faster. | |
| - Upon being faced with the problem of my agent not learning, I went | |
| into research mode and got to learn a lot about DQN and its improved | |
| versions. I am not a master of the algorithms yet (I have yet to get | |
| an agent to perform well in the game), but I feel like I understand | |
| how each version works. | |
| - Rather than just following someone’s tutorial, also reading the | |
| actual papers for that particular algorithm helped me understand the | |
| algorithm better and code it. | |
| - Doing this project reinforced into me that I love the concept of | |
| reinforcement learning. It has made me even more interested into | |
| exploring the field further and learn more. | |
| # References / Resources | |
| - [Reinforcement Learning (DQN) Tutorial, Adam | |
| Paszke](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html) | |
| - [Train a mario-playing RL agent, Yuansong Feng, Suraj Subramanian, | |
| Howard Wang, Steven | |
| Guo](https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html) | |
| - [About Double DQN, Dueling | |
| DQN](https://horomary.hatenablog.com/entry/2021/02/06/013412) | |
| - [Dueling Network Architecture for Deep Reinforcement Learning (Wang | |
| et al., 2015))](https://arxiv.org/abs/1511.06581) | |
| *(Final source code for the project can be found* | |
| [*here*](https://github.com/00ber/ml-reinforcement-learning)*)*. | |