Computable AI - Fundamentalshttps://computable.ai/2019-11-27T00:00:00-05:00A Machine Intelligence BlogDeep RL Fundamentals #1: Elements of RL2019-11-27T00:00:00-05:002019-11-27T00:00:00-05:00Andrew Farabowtag:computable.ai,2019-11-27:/articles/2019/Nov/27/deep-rl-fundamentals-1-elements-of-rl.html<p>A breakdown of the various parts of the reinforcement learning problem and algorithms that solve them</p> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>It's a running joke on our team that pretty much every paper on reinforcement learning has some canned description of what RL is in their first or second section. Below is Lillicrap et al’s take on the classic formula from their paper on Deep Deterministic Policy Gradients.</p> <blockquote><p>We consider a standard reinforcement learning setup consisting of an agent interacting with an environment E in discrete timesteps. At each timestep $t$ the agent receives an observation $x_t$, takes an action at and receives a scalar reward $r_t$. In all the environments considered here the actions are real-valued $a_t ∈ R_N$. In general, the environment may be partially observed so that the entire history of the observation, action pairs $s_t = (x_1, a_1, ..., a_{t-1}, x_t)$ may be required to describe the state. Here, we assumed the environment is fully-observed so $s_t = x_t$. An agent’s behavior is defined by a policy, $π$, which maps states to a probability distribution over the actions $π : S → P(A)$. The environment, $E$, may also be stochastic. We model it as a Markov decision process with a state space $S$, action space $A = R_N$ , an initial state distribution $p(s1)$, transition dynamics $p(s_{1+1}|s_t, a_t)$, and reward function $r(s_t, a_t)$.</p> </blockquote> <p>Not particularly informative for someone without a background in RL, is it? In this post I’m going to break down the various elements into chunks which will (hopefully) be easier to digest.</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <h2 id="State">State<a class="anchor-link" href="#State">&#182;</a></h2><p>The state is a numerical representation of the environment at a given timestep, modeled as a Markov Decision Process. An MDP is a way of representing processes with discrete time steps and events that are partially random and partially controlled by a decision maker. In the context of RL, the environment may be a video game, a physical robot and its surroundings, a software system, a smart home, or any complex system we might want to manipulate. The state is technically not always the same as the observation, which is the part of the state that is observable to the agent. The two terms are often used interchangeably and in many problem spaces they refer to the same thing.When only part of the observation is observable, the environment is called a Partially Observable Markov Decision Process (POMDP). Without extra enhancements (usually RNNs), standard MDPs are much easier than POMDPs to obtain good results from.</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <h2 id="Policy">Policy<a class="anchor-link" href="#Policy">&#182;</a></h2><p>The policy is the primary decision making mechanism in reinforcement learning. It is responsible for choosing an action to take based on the current state of the environment. In deep reinforcement learning, the policy can either be neural network or a simple max function over the values (which I elaborate upon below) of each possible action. Policies are often stochastic, meaning that they give probabilities that each possible action is the correct one and choose randomly based on those probabilities.</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <h2 id="Reward">Reward<a class="anchor-link" href="#Reward">&#182;</a></h2><p>The reward is a single-number measure of how good or bad an agent is doing. The goal of an RL agent is to maximize the total amount of reward received. Obtaining a reward signal that is easy to learn from and highly correlated with what you want your agent to achieve can be one of the most significant challenges of applying RL. On the simpler end of the spectrum are problems like Atari games, where a consistent numerical reward (the score) can be obtained by accessing the game’s memory. A more complicated example would be a simulation in which a robotic arm must grasp an object and move it to a specific location. It would be tempting to provide a reward of 1 if the goal was achieved and 0 otherwise, however, this kind of “sparse reward signal” is extremely difficult to learn from. In practice, researchers may engineer a reward function for each task, deciding on metrics that should be encouraged or penalized because they are correlated with eventual success at the task.</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p><br></p> <p><img src="https://spinningup.openai.com/en/latest/_images/rl_diagram_transparent_bg.png" alt="From OpenAI&#39;s Spinning Up"></p> <p>Interaction between the pieces we've covered so far. Image is from OpenAI's Spinning Up Series</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <h2 id="Value-Function">Value Function<a class="anchor-link" href="#Value-Function">&#182;</a></h2><p>The value of a state is simply the maximum expected reward that can be obtained from that state, and the value function (is usually a neural network that) predicts the value given a state or state-action pair. In order to avoid learning extremely shortsighted behavior, value functions consider not just the immediate reward but all possible rewards that can be achieved after.</p> <p>Imagine trying to learn how to perform a basic task, say, making and eating a sandwich, with total naivety and without the benefit of foresight. When faced with the choice between eating all of the bread plain or making it into a sandwich and eating that, most people would choose the latter (I hope). However, without the ability to take into account what happens in the future (eating a completed sandwich), in the moment the choice becomes one between eating the bread and not eating the bread. If eating the bread provides a reward and not eating the bread provides none, one would choose to eat the bread and never find out what happens if you postpone eating to create something else. This is the problem a value function tries to fix, by combining the predicted reward for an action with the estimated maximum possible future reward that can be obtained from a given state. In the case of the sandwich maker, eating the bread is what is called a “terminal state.” No future rewards are possible, since you have eaten all the bread and cannot make a sandwich. Not eating the bread, on the other hand, opens up the possibility of making a sandwich. While the immediate reward from this decision is zero, the value takes into account the highest possible reward that can be obtained in the future. Thus, the value takes into account the reward from eating the sandwich.</p> <p>This idea is captured by the Bellman Equation, where $Q(s, a)$ equals the value of an action $a$ taken from a particular state $s$ and $r(s, a)$ equals the immediate reward for the action:</p> <p>$Q^*(s,a) = E[{r(s,a) + \gamma \max_{a'} Q^*(s',a')}]$</p> <p>The equation is recursive in that $Q(s, a)$ is defined in terms of $Q(s’, a’)$, where $s’$ and $a’$ are the next state and action. When a terminal state is reached (the game is over), $Q(s_{final}, a_{final}) = r$ and the recursion terminates.</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <h2 id="Conclusion">Conclusion<a class="anchor-link" href="#Conclusion">&#182;</a></h2><p>I hope the paper quoted at the start of the post is easier to decipher now. The next post will cover classifications of environments/problems, the different types of DRL algorithms, and the trade-offs between the various approaches. In the process we will start to go into the specifics of how the abstract concepts of policy and value functions are actually implemented.</p> </div> </div> </div> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['$','$'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " linebreaks: { automatic: true, width: '95% container' }, " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')).appendChild(mathjaxscript); } </script> Deep RL Fundamentals #0: What is Deep RL and Why It's Worth Learning2019-08-14T00:00:00-04:002019-08-14T00:00:00-04:00Andrew Farabowtag:computable.ai,2019-08-14:/articles/2019/Aug/14/deep-rl-fundamentals-0-what-is-deep-rl-and-why-its-worth-learning.html<p>An introduction and statement of purpose for a series on the basics of deep reinforcement learning</p> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <h3 id="Intro">Intro<a class="anchor-link" href="#Intro">&#182;</a></h3><p>Deep reinforcement learning - the fusion of trial-and-error learning and function-approximating neural networks - is one of the hottest areas of machine learning research right now and is the subject of much excitement, largely, I believe, because of how it resembles the endgame of AI research, artificial general intelligence, in a way that neither supervised nor unsupervised learning does. There is, however, a prevailing attitude that RL is not ready to be put to use in practical scenarios and instead belongs solely in the laboratories of universities and tech giants, conquering toy challenges and video games one at a time until it is ready to emerge. While the many present shortcomings of Deep RL provide good evidence for this viewpoint (some of which I will discuss later in the series), I think Deep RL is ready to tackle many real-world challenges and getting hobbyists/companies involved sooner rather than later would accelerate development. My immediate purposes for writing are to explain what reinforcement learning is and to kick off my post series about the major RL algorithms, but ultimately I want to encourage others to begin hacking away with DRL and try applying it to real-world problems.</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <h3 id="What-is-it?">What is it?<a class="anchor-link" href="#What-is-it?">&#182;</a></h3><p>Reinforcement learning algorithms attempt to attain a goal by taking actions in their environment, assessing their performance and altering their behavior. The performance assessment comes in the form of a reward signal derived from the environment. This could be the score of Pong game, the time since a humanoid robot last fell over, or a simple binary measure of whether a self-driving car has taken its passenger to the destination successfully or not. In order to maximize reward, any approach to reinforcement learning must have some structure to choose the correct action. In deep reinforcement learning, this is one or more neural networks. Neural networks are ideal because of their ability to generalize in complex, high-dimensional environments. Depending on the approach taken, they can take in the current state of the environment and output either an action to take or the desirability of a certain state.</p> <p>In this post I use reinforcement learning (RL) and deep reinforcement learning (DRL) interchangeably, however, they refer to slightly different concepts. As Sutton and Barto put it:</p> <blockquote><p>Reinforcement learning is like many topics with names ending in -ing, such as machine learning, planning, and mountaineering, in that it is simultaneously a problem, a class of solution methods that work well on the class of problems, and the field that studies these problems and their solution methods.</p> </blockquote> <p>Deep reinforcement learning is one such type of solution method that utilizes neural networks. Other RL solutions exist, including dynamic programming and tabular reinforcement learning, which uses lookup tables to record the reward associated with encountered states instead of neural networks.</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <h3 id="Drawbacks">Drawbacks<a class="anchor-link" href="#Drawbacks">&#182;</a></h3><p>Modern reinforcement learning is not without its shortcomings. The foremost among these is sample efficiency - the number of times that an algorithm must observe a state, take an action, and improve is currently crippling for many use cases. Atari games that take humans minutes to pick up take state-of-the-art DRL algorithms millions of frames to master $^{1}$. In addition, reinforcement learning algorithms assume that the environment is a Markov decision process. This means that they assume that the optimal action to be taken in a certain state can be determined from a single observation. This poses a problem for many real-life problems that people would want to solve with RL. While recurrent and convolutional neural networks can help, they come at the cost of even worse sample efficiency.</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <h3 id="What&#8217;s-next?">What&#8217;s next?<a class="anchor-link" href="#What&#8217;s-next?">&#182;</a></h3><p>While I have experience working on a deep reinforcement learning-powered product, there are many areas in which my knowledge is lacking. In writing this post series, I hope to fill in some of those gaps. In the next post, I plan on elaborating upon the standard formulation of reinforcement learning (as described in the beginning of nearly every RL paper) and covering the major traits that differentiate approaches to DRL. I will be using Sutton and Barto’s Reinforcement Learning as my primary source and I recommend that anyone who is interested pick up a copy or <a href="http://incompleteideas.net/book/the-book-2nd.html">read the free online version</a>.</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div><div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>$^{1}$ <a href="https://arxiv.org/abs/1710.02298">Rainbow: Combining Improvements in Deep Reinforcement Learning</a></p> </div> </div> </div> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['$','$'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " linebreaks: { automatic: true, width: '95% container' }, " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')).appendChild(mathjaxscript); } </script>