Reinforcement learning coupled with deep learning based function approximation has been an exciting area over the past couple years. The appeal of learning methods which can effectively learn to search an action/reward environment and derive a good policy based on experience and random exploration is quite significant for a wide range of applications. Vlad Minh’s original DeepMind Deep-Q Networks (DQN) paper demonstrating raw-pixel based learning on Atari games was an awesome demonstration of what was possible, and there have been tons of improvements and other interesting applications by others since then.
Since then, who hasn’t wanted to play around with RL on the handful of Atari games and their own domain specific automation tasks? DeepMind released their code for this experiment along with the Nature paper, but it was frustratingly in Lua/Torch and as the paper stated, takes quite long (~30 days?) to learn Atari games to a high level of skill. Since then I’ve become quite fond of working with Keras, Theano, and TensorFlow on a range of ML problems — the workflow and simplicity of python/numpy/tensor algorithm definition and Cuda cross-compilation is just too attractive and productive for me to want to work in anything else right now, so naturally I wanted to leverage these same tools in the RL space. I started looking into DQN and other RL algorithm implementations available and found a handful of helpful examples, but no particularly featureful, fast or satisfying projects which were designed to be easily applied to your own environments, which got me thinking about standardizing an interface to environments so that learners could be easily applied to a wide class of problems. Shortly after this though occured to me, OpenAI published their GYM software and online environment scoreboard — pretty much solving this problem and providing a wide range of environmental learning tasks already integrated into a relatively simple reinforcement learning environment API. This was great, I started playing with DQN implementations leveraging Keras on top of Gym and KeRLym (Keras+RL+Gym) was the result.
The initial results from kerlym were relatively frustrating, DQN tuning is hard and implementing the algorithms is error prone. The Atari simulator isn’t the fastest, and it takes quite a while to sequentially play enough games to generate a significant amount of experience. So then there’s been a good bit of work recently in asynchronous methods for RL, running lots of agents in parallel to each run their own episodes and share model parameters and gradients. Corey Lynch published an awesome implementation of async-rl using Keras and Gym-based Atari games which I spent a good bit of time playing with. The result was I refactored kerlym significantly to leverage a lot of the async-dqn techniques demonstrated there.
With the new asynchronous DQN implementation, frame-diff’ing, an atari frame pre-processor Andrej Karpathy used recently in his blog post about RL, I finally had a somewhat effective learner that I could set loose on Atari games and see a gradual improvement of total reward (score per game) take form over the course of several hours. Below is an example of ~64k episodes of Breakout running on kerlym with diagnostic plots enabled to monitor training performance.
At this point I finally have some confidence in the correctness of this agent implementation, but there are still countless hyper-parameters which can be tuned and significantly effect performance.
There are a couple of directions I hope to go at this point:
- Implementing additional agents to compare training performance: I love the speedup of asynchronous/concurrent training, and I’m impatient for multi-day RL tests, so I would really love to add working asynchronous Policy Gradient (PG), TRPO, and A3C agents which can be easily interchanged and tested.
- Exploring applications of these learning agents to new domains: What other tasks can we readily train and execute using the learning models we have at this point? Being an applied person, I kind of want to throw DQNs at every task under the sun at this point and see what works, the goal of kerlym is largely to make this easy to do. I’ve started building out-of-tree gym environments for various tasks, such as the tone search task described here, and its exciting to think of the possibilities applying this this to a number of radio domain tasks.
For now, its hard to stop watching kerlym play Breakout and Pong over and over, slowly improving.