Deep reinforcement learning - the fusion of trial-and-error learning and function-approximating neural networks - is one of the hottest areas of machine learning research right now and is the subject of much excitement, largely, I believe, because of how it resembles the endgame of AI research, artificial general intelligence, in a way that neither supervised nor unsupervised learning does. There is, however, a prevailing attitude that RL is not ready to be put to use in practical scenarios and instead belongs solely in the laboratories of universities and tech giants, conquering toy challenges and video games one at a time until it is ready to emerge. While the many present shortcomings of Deep RL provide good evidence for this viewpoint (some of which I will discuss later in the series), I think Deep RL is ready to tackle many real-world challenges and getting hobbyists/companies involved sooner rather than later would accelerate development. My immediate purposes for writing are to explain what reinforcement learning is and to kick off my post series about the major RL algorithms, but ultimately I want to encourage others to begin hacking away with DRL and try applying it to real-world problems.

What is it?

Reinforcement learning algorithms attempt to attain a goal by taking actions in their environment, assessing their performance and altering their behavior. The performance assessment comes in the form of a reward signal derived from the environment. This could be the score of Pong game, the time since a humanoid robot last fell over, or a simple binary measure of whether a self-driving car has taken its passenger to the destination successfully or not. In order to maximize reward, any approach to reinforcement learning must have some structure to choose the correct action. In deep reinforcement learning, this is one or more neural networks. Neural networks are ideal because of their ability to generalize in complex, high-dimensional environments. Depending on the approach taken, they can take in the current state of the environment and output either an action to take or the desirability of a certain state.

In this post I use reinforcement learning (RL) and deep reinforcement learning (DRL) interchangeably, however, they refer to slightly different concepts. As Sutton and Barto put it:

Reinforcement learning is like many topics with names ending in -ing, such as machine learning, planning, and mountaineering, in that it is simultaneously a problem, a class of solution methods that work well on the class of problems, and the field that studies these problems and their solution methods.

Deep reinforcement learning is one such type of solution method that utilizes neural networks. Other RL solutions exist, including dynamic programming and tabular reinforcement learning, which uses lookup tables to record the reward associated with encountered states instead of neural networks.


Modern reinforcement learning is not without its shortcomings. The foremost among these is sample efficiency - the number of times that an algorithm must observe a state, take an action, and improve is currently crippling for many use cases. Atari games that take humans minutes to pick up take state-of-the-art DRL algorithms millions of frames to master $^{1}$. In addition, reinforcement learning algorithms assume that the environment is a Markov decision process. This means that they assume that the optimal action to be taken in a certain state can be determined from a single observation. This poses a problem for many real-life problems that people would want to solve with RL. While recurrent and convolutional neural networks can help, they come at the cost of even worse sample efficiency.

What’s next?

While I have experience working on a deep reinforcement learning-powered product, there are many areas in which my knowledge is lacking. In writing this post series, I hope to fill in some of those gaps. In the next post, I plan on elaborating upon the standard formulation of reinforcement learning (as described in the beginning of nearly every RL paper) and covering the major traits that differentiate approaches to DRL. I will be using Sutton and Barto’s Reinforcement Learning as my primary source and I recommend that anyone who is interested pick up a copy or read the free online version.