Computable AI - Miscellanyhttps://computable.ai/2019-08-12T00:00:00-04:00A Machine Intelligence BlogEquivalence between Policy Gradients and Soft Q-Learning2019-08-12T00:00:00-04:002019-08-12T00:00:00-04:00Braden Hoaglandtag:computable.ai,2019-08-12:/articles/2019/Aug/12/equivalence-between-policy-gradients-and-soft-q-learning.html<p>Inspecting the gradients of entropy-augmented policy updates to show their equivalence</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Introduction">Introduction<a class="anchor-link" href="#Introduction">¶</a></h1><p>This article will dive into a lot of the math surrounding the gradients of different maximum entropy RL learning methods. Usually we work in the space of objective functions in practice: with both policy gradients and Q-learning, we'll form an objective function and allow an autodiff library to calculate the gradients for us. We never have to see what's going on behind the scenes, which has its pros and cons. A benefit is that working with objective functions is much easier than calculating gradients by hand. On the other hand, it's easy to lose sight of what's really going on when we work at such an abstract level.</p>
<p>This abstraction issue is tackled in the paper <code>Equivalence Between Policy Gradients and Soft Q-Learning</code> (<a href="https://arxiv.org/abs/1704.06440">https://arxiv.org/abs/1704.06440</a>), and I think it provides some pretty eye-opening insights into what the most common RL algorithms are really doing. I'll be working off of version 4 of the paper from Oct. 2018, the most recent version of the paper at the time of writing.</p>
<p>First I'll walk through some of the basic definitions in the max-entropy RL setting, then I'll pick out the most important bits of math from the paper that show how entropy-augmented Q-learning is really just a policy gradient method.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Maximum-Entropy-RL-and-the-Boltzmann-Policy">Maximum Entropy RL and the Boltzmann Policy<a class="anchor-link" href="#Maximum-Entropy-RL-and-the-Boltzmann-Policy">¶</a></h1><p>In standard RL, we try to maximize expected cumulative reward $\mathbb{E}[\sum_t r_t]$. In the max-entropy setting, we augment this reward signal with an entropy bonus. The expected cumulative reward of a policy $\pi$ is commonly denoted as $\eta(\pi)$</p>
\begin{align*}
\eta(\pi) &= \mathbb{E} \Big[ \sum_t (r_t + \alpha \mathcal{H}(\pi)) \Big] \\
&= \mathbb{E} \Big[ \sum_t \big( r_t - \alpha \log\pi(a_t | s_t) \big) \Big]
\end{align*}<p>where $\pi$ is our current policy and $\alpha$ weights how important the entropy is in our reward definition. This intuitively makes the reward seem higher when our policy exhibits high entropy, allowing it to explore its environment more extensively. A key component of this augmented objective is that the entropy is <em>inside</em> the sum. Thus an optimal policy will not only try to act with high entropy <em>now</em>, but will act in such a way that it finds highly-entropic states in the <em>future</em>.</p>
<p>The paper uses slightly different notation, opting to use KL divergence (AKA "relative entropy") instead of just entropy. This uses a reference policy $\bar{\pi}$, which can be thought of as an old, worse policy that we wish to improve on</p>
\begin{align*}
\eta(\pi) &= \mathbb{E} \Big[ \sum_t (r_t - \alpha \log\pi(a_t|s_t) + \alpha \log\bar{\pi}(a_t|s_t) \Big] \\
&= \mathbb{E} \Big[ \sum_t \big(r_t - \alpha D_{KL}(\pi \,\Vert\, \bar{\pi}) \big) \Big]
\end{align*}<p>In the max-entropy setting, optimal policies are stochastic and proportional to exponential of the optimal Q-function. This can be expressed formally as</p>
$$ \pi^* \propto e^{Q^*(s,a)} $$<p>If this doesn't seem very intuitive, I would recommend a quick scan of the article <a href="https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/">https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/</a>. It offers a brief introduction to max-entropy RL (specifically for Q-learning) and some helpful intuitions as to why the above relationship is a good property for a policy to have.</p>
<p>To actually get a policy in this form, we'll change up the definition slightly</p>
$$
\pi = \frac{\bar{\pi} \, e^{Q(s,a) / \alpha}}{\mathbb{E}_{\bar{a}\sim\bar{\pi}} [e^{Q(s,\bar{a}) / \alpha}]}
$$<p>The numerator of this expression is simply stating that we want our new policy to be like our old policy, but slightly in the direction of $e^Q$. If $\alpha$ is higher (i.e. we want more entropy), we move less in the direction of $e^Q$. The denominator is a normalization constant that ensures that our entire expression is still a valid probability distribution (i.e. the sum over all possible actions comes out to 1).</p>
<p>You may have noticed that the denominator of our policy is really just $e^V$ since $V = \mathbb{E}_{a}[Q]$. We'll use this to simplify our policy</p>
\begin{align*}
V(s) &= \alpha \log \mathbb{E}_{a\sim\bar{\pi}} \big[ e^{Q(s,a)/\alpha} \big] \\
\pi &= \bar{\pi} \, e^{(Q(s,a) - V(s)) / \alpha}
\end{align*}<p>This new policy definition shows more directly that our policy is proportional to the exponential of the advantage. If our policy is proportional to $e^Q$, it should also be proportional to $e^A$, so this makes sense. From now on, we'll refer to this policy as the 'Boltzmann Policy' and denote it $\pi^B$.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Soft-Q-Learning-with-Boltzmann-Backups">Soft Q-Learning with Boltzmann Backups<a class="anchor-link" href="#Soft-Q-Learning-with-Boltzmann-Backups">¶</a></h1><p>From this point onward, there will inevitably be sections of math that seem to leave out non-trivial amounts of work. This is because I think this paper mainly benefits our intuitions about RL. The math proves these new intuitions, but by itself is hard to read. If you're curious and wish to go through all the derivations, I would highly recommend working through the full paper on your own. With that disclaimer out of the way, we can get started...</p>
<p>With normal Q-learning, we define our backup operator $\mathcal{T}$ as follows
$$
\mathcal{T}Q = \mathbb{E}_{r,s'} \big[ r + \gamma \mathbb{E}_{a'\sim\pi}[Q(s', a')] \big]
$$</p>
<p>In the max-entropy setting, we'll have to add in an entropy bonus to the reward signal and simplify accordingly</p>
\begin{align*}
\mathcal{T}Q &= \mathbb{E}_{r,s'} \big[ r + \gamma \mathbb{E}_{a'}[Q(s', a')] - \alpha D_{KL} \big( \pi(\cdot|s') \;\Vert\; \bar{\pi}(\cdot|s') \big) \big] \\
&= \mathbb{E}_{r,s'} \big[ r + \gamma \alpha \log \mathbb{E}_{a'\sim\bar{\pi}}[e^{Q(s',a')/\alpha}] \big]
\end{align*}<p>See equations 11 and 13 from the paper (which rely on equations 2-6) if you want to see just how exactly that simplication works. To actually perform the optimization step $Q \gets \mathcal{T}Q$, we'll minimize the mean squared error between our current $Q$ and an estimate of $\mathcal{T}Q$. Our regression targets can be defined</p>
\begin{align*}
y &= r + \gamma \alpha \log \mathbb{E}_{a'\sim\bar{\pi}} \big[ e^{Q(s', a') / \alpha} \big] \\
&= r + \gamma V(s')
\end{align*}<p>Using Boltzmann backups instead of the traditional Q-learning backups is what transforms normal Q-learning into what's conventionally called "soft" Q-learning. That's really all there is to it.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Policy-Gradients-and-Entropy">Policy Gradients and Entropy<a class="anchor-link" href="#Policy-Gradients-and-Entropy">¶</a></h1><p>I'm assuming you have a solid grasp of policy gradients if you're reading this article, so I'm gonna focus on how they usually aren't applied correctly in the max-entropy setting. PG methods are commonly augmented with an entropy term, like with the following example provided from the paper</p>
$$
\mathbb{E}_{t, s,a} \Big[ \nabla_\theta \log\pi_\theta(a|s) \sum_{t' \geq t} r_{t'} - \alpha D_{KL}\big (\pi_\theta(\cdot|s) \;\Vert\; \pi(\cdot|s) \big) \Big]
$$<p>This example essentially tries to maximize reward-to-go with an entropy for the <em>current</em> timestep. Maximizing this objective technically isn't what we want, even if it's common practice. What we really want is to maximize a sum over all rewards and entropies that our agent experiences from now into the future.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Soft-Q-Learning-=-Policy-Gradient">Soft Q-Learning = Policy Gradient<a class="anchor-link" href="#Soft-Q-Learning-=-Policy-Gradient">¶</a></h1><p>The first of two conclusions that this paper comes to is that Soft Q-Learning and the Policy Gradient have exact first-order equivalence. Using the value function and Boltzmann policy definitions from earlier, we can derive the gradient of $\mathbb{E}_{s,a} \big[ \frac{1}{2} \Vert Q_\theta(s,a) - y \Vert^2 \big]$. The paper is able to produce the following expression</p>
$$
\mathbb{E}_{s,a} \Big[ \color{red}{-\alpha \nabla_\theta \log\pi_\theta(a|s) \Delta_{TD} + \alpha^2 \nabla_\theta D_{KL}\big( \pi_\theta(\cdot|s) \;\Vert\; \bar{\pi}(\cdot|s) \big)} + \color{blue}{\nabla_\theta \frac{1}{2} \Vert V_\theta(s) - \hat{V} \Vert^2} \Big]
$$<p>where $\Delta_{TD}$ is the discounted n-step TD error and $\hat{V}$ is the value regression target formed by $\Delta_{TD}$.</p>
<p>That's kind of a lot, but we can break it down pretty easily. The terms in red represent 1) the usual policy gradient and 2) an additional KL divergence gradient term. The red terms overall represent the gradient you get if you use a policy gradient algorithm with a KL divergence term as your entropy bonus (the actor loss in an actor-critic formulation). The term in blue is quite simply the gradient used to minimize the mean squared error between our current value estimates and our value targets (the critic loss in an actor-critic formulation).</p>
<p>Don't forget that we never explicitly tried to calculate these terms. They came about naturally as an effect of minimizing mean squared error of our Q function and a Boltzmann backup target.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Soft-Q-Learning-and-the-Natural-Policy-Gradient">Soft Q-Learning and the Natural Policy Gradient<a class="anchor-link" href="#Soft-Q-Learning-and-the-Natural-Policy-Gradient">¶</a></h1><p>The next section of the paper details another connection between Soft Q-learning and policy gradient methods, specifically that damped Q-learning updates are exactly equivalent to natural policy gradient updates.</p>
<p>The natural policy gradient weights the policy gradient with the Fisher information matrix $\mathbb{E}_{s,a} \Big[ \big( \nabla_\theta \log\pi_\theta(a|s) \big)^T \big( \nabla_\theta \log\pi_\theta(a|s) \big) \Big]$. The paper shows that the natural policy gradient in the max-entropy setting is equivalent not to soft Q-learning by itself, but instead to a damped version. In this damped version, we calculate a backed-up Q value and then interpolate between it and the current Q value estimate (basically using Polyak averaging instead of running gradient descent on a mean squared error term).</p>
<p>Although not nearly as direct, this connection highlights how higher-order connections between soft Q-learning and policy gradient methods exist. Higher-order equalities between functions point to functions that are increasingly similar, so this connection really drives the point home that soft Q-learning is deceptively like the policy gradient methods we've been using all this time.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Experimental-Results">Experimental Results<a class="anchor-link" href="#Experimental-Results">¶</a></h1><p>The paper authors decided to be nice to us and actually test the theory they derived on some Atari games.</p>
<p>They started out with testing whether or not the usual way of adding entropy bonuses to policy gradient methods is actually worse than the theoretical claims they had just made. As it turns out, using future entropy bonuses $\Big( \text{i.e. } \big( \sum r + \mathcal{H} \big) \Big)$ instead of the simpler, immediate entropy bonus $\Big( \text{i.e. } \big( \sum r \big) + \mathcal{H} \Big)$ results in either similar or superior performance. The below graphs show the results from the experiments, with the future entropy version in blue and the immediate entropy version in red.</p>
<p><img src="https://computable.ai/images/proper_entropy.png" alt="image.png"></p>
<p>They then tested how soft Q-learning compared to normal Q-learning. To make traditional DQN into soft Q-learning, they just modified the regression targets for the Q function. They used the normal target, a target with a KL divergence penalty, and a target with just an entropy bonus. They found that just the entropy bonus resulted in the most improvement, although both soft methods outperformed the "hard" DQN.</p>
<p><img src="https://computable.ai/images/q_hard_soft.png" alt="image.png"></p>
<p>To round things out, they tested soft Q-learning and the policy gradient on the same Atari environments to see if they were equivalent in practice. After all, the math shows that their expectations are equivalent, but the variance of those expectations could be different. The experiments they ran make it seem like the two methods are pretty close to each other, with no method seeming largely superior.</p>
<p><img src="https://computable.ai/images/pg_ql.png" alt="image.png"></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Conclusion-and-Future-Work">Conclusion and Future Work<a class="anchor-link" href="#Conclusion-and-Future-Work">¶</a></h1><p>Hopefully this made you reconsider what's really going on under the hood with Q-learning. Personally, it blew my mind that two seemingly disparate learning methods could boil down to the same expected update. The theoretical possibilities that this connection could lead to is also incredibly exciting.</p>
<p>Of course, this paper focuses its empirical testing just on environemnts with discrete action spaces. Since the Boltzmann policy is intractable to sample from in continuous action spaces, more advanced soft Q-learning algorithms (such as Soft Actor-Critic) are currently being pioneered to get accurate results in those more complicated settings as well.</p>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Distributional Deep Q-Learning2019-07-26T00:00:00-04:002019-07-26T00:00:00-04:00Braden Hoaglandtag:computable.ai,2019-07-26:/articles/2019/Jul/26/distributional-deep-q-learning.html<p>Expanding DQN to produce estimates of return distributions, and an exploration into why this helps learning</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Overview">Overview<a class="anchor-link" href="#Overview">¶</a></h1><p>I recently stumbled upon the world of distributional Q-learning, and I hope to share some of the insights I've made from reading the following papers:</p>
<ul>
<li>A Distributional Perspective on Reinforcement Learning: <a href="https://arxiv.org/abs/1707.06887">https://arxiv.org/abs/1707.06887</a></li>
<li>Implicit Quantile Networks for Distributional Reinforcement Learning: <a href="https://arxiv.org/abs/1806.06923">https://arxiv.org/abs/1806.06923</a></li>
</ul>
<p>This article will loosely work through the two papers in order, as they build on each other, but hopefully I can trim off most of the extraneous information and present you with a nice overview of distributional RL, how it works, and how to improve upon the most basic distributional algorithms to get to the current state-of-the-art.</p>
<p>First I'll introduce distributional Q-learning and try to provide some motivations for using it. Then I'll highlight the strategies used in the development of C51, one of the first highly successful distributional Q-learning algorithms (paper #1). Then I'll introduce implicit quantile networks (IQNs) and explain their improvements to C51 (paper #2).</p>
<p><em>Quick disclaimer: I'm assuming you're familiar with how Q-learning works. That includes V and Q functions, Bellman backups, and the various learning stability tricks like target networks and replay buffers that are commonly used.</em></p>
<p><em>Another important note is that these algorithms are only for <strong>discrete</strong> action spaces.</em></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Motivations-for-Distributional-Deep-Q-Learning">Motivations for Distributional Deep Q-Learning<a class="anchor-link" href="#Motivations-for-Distributional-Deep-Q-Learning">¶</a></h1><p>In standard Q-Learning, we attempt to learn a function $Q(s, a): \mathcal{S \times A} \rightarrow \mathbb{R}$ that maps state-action pairs to the expected return from that state-action pair. This gives us a pretty accurate idea of how good specific actions are in specific states (if our $Q$ is accurate), but it's missing some information. There exist distributions of returns that we can receive from each state-action pair, and the expectations/means of these distributions is what $Q$ attempts to learn. But why only learn the expectation? Why not try to learn the whole distribution?</p>
<p>Before diving into the algorithms that have been developed for this specific purpose, it's helpful to think about why this is beneficial in the first place. After all, learning a distribution is a lot more complicated than learning a single number, and we don't want to waste precious computational resources on doing something that doesn't help much.</p>
<h3 id="Stabilized-Learning">Stabilized Learning<a class="anchor-link" href="#Stabilized-Learning">¶</a></h3><p>The first possibility I'll throw out there is that learning distributions could stabilize learning. This may seem unintuitive at first, seeing as we're trying to learn something much more complicated than an ordinary $Q$ function. But let's think about what happens when stochasticity in our environment results in our agent receiving a highly unusual return. I'll use the example of driving a car through an intersection.</p>
<p>Let's say you're waiting at a red light that turns green. You begin to drive forward, expecting to simply cruise through the intersection and be on your way. Your internal model of your driving is probably saying "there's no way anything bad will happen if you go straight right now", and there's no reason to think otherwise. But now let's say another driver on the road perpendicular to yours runs straight through their red light and crashes into you. You would be right to be incredibly surprised by this turn of events (and hopefully not dead, either), but how surprised should you be?</p>
<p>If your internal driving model was based only on expected returns, then you wouldn't predict that this accident would occur at all. And since it just <em>did</em> happen, you may be tempted to drastically change your internal model and, as a result, be scared of intersections for quite a bit until you're convinced that they're safe again; however, what if your driving model was based on a <em>distribution</em> over all possible returns? If you mentally assigned a probability of 0.00001 to this accident occurring, and if you've driven through 100,000 intersections before throughout your lifetime, then this accident isn't really that surprising. It still totally sucks and your car is probably totaled, but you shouldn't be irrationally scared of intersections now. After all, you just proved that your model was right!</p>
<p>So yeah that's kinda dark, but I think it highlights how learning a distribution instead of an expectation can reduce the effects of environment stochasticity<sup>1</sup></p>
<h3 id="Risk-Sensitive-Policies">Risk Sensitive Policies<a class="anchor-link" href="#Risk-Sensitive-Policies">¶</a></h3><p>Using distributions over returns also allows us to create brand new classes of policies that take risk into account when deciding which actions to take. I'll use another example that doesn't involve driving but is equally as deadly :) Let's say you need to cross a gorge in the shortest amount of time possible (I'm not sure why, but you do. This is a poorly formulated example). You have two options: using a sketchy bridge that looks like it may fall apart at any moment, or you could walk down a set of stairs on one side of the gorge and then up a set of stairs on the other side. The latter option is incredibly safe. It'll still take significantly longer than using the bridge, though, so is it worth it?</p>
<p>For the purposes of this example, let's give dying a reward of $-1000$ and give every non-deadly interaction with the environment a reward of $-1$. Let's also say that taking the bridge gets you across the gorge in $10$ seconds with probability $0.5$ of making it across safely. Taking the stairs gets you across the gorge $100%$ of the time, but it takes $100$ seconds instead.</p>
<p>Given this information, we can quickly calculate expected returns for each of the two actions</p>
$$
\mathbb{E}[\text{return}_\text{bridge}] = (-1000 * 0.5) + (-10 * 0.5) = -505 \\
\mathbb{E}[\text{return}_\text{stairs}] = -100
$$<p>If you made decisions like a standard Q-learning agent, you would never take the bridge. The expected return is much worse than that of taking the stairs, so there's no reason to choose it. But if you made decisions like a distributional Q-learning agent, your decision can be much more well informed. You can be aware of the probability of dying vs. getting across the gorge more quickly by using the bridge. If the risk of falling to your death is worth it in your particular situation (let's say you're being chased by a wild animal who can run much faster than you), then taking the bridge instead of the stairs could end up being what you want.</p>
<p>Although this example was pretty contrived, it highlights how using return distributions allows us to choose policies that before would have been impossible to formulate. Want a policy that takes as little risk as possible? We can do that now. Want a policy that takes as much risk as possible? Go right ahead, but please don't fall into any gorges.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-Distributional-Q-Learning-Framework">The Distributional Q-Learning Framework<a class="anchor-link" href="#The-Distributional-Q-Learning-Framework">¶</a></h1><p>So now we have a few reasons why using distributions over returns instead of just expected return can be useful, but we need to formulate a few things first so that we can use Q-learning strategies in this new setting.</p>
<p>We'll define $Z(s, a)$ to be the distribution of returns at a given state-action pair, where $Q(s, a)$ is the expected value of $Z(s, a)$.</p>
<p>The usual Bellman equation for $Q$ is defined</p>
$$
Q(s, a) = \mathbb{E}[r(s, a)] + \gamma \mathbb{E}[Q(s', a')]
$$<p>Now we'll change this to be defined in terms of entire distributions instead of just expectations by using $Z$ instead of $Q$. We'll denote the distribution of rewards for a single state-action pair $R(s,a)$.</p>
$$
Z(s, a) = R(s, a) + \gamma Z(s', a')
$$<p>All we need now is a way of iteratively enforcing this Bellman constraint on our $Z$ function. With standard Q-learning, we can do that quite simply by minimizing mean squared error between the outputs of a neural network (which approximates $Q$) and the values $\mathbb{E}[r(s, a)] + \gamma \mathbb{E}[Q(s', a')]$ computing using a target Q-network and transitions sampled from a replay buffer.</p>
<p>Such a straightforward solution doesn't exist in the distributional case because the output from our Z-network is so much more complex than from a Q-network. First we have to decide what kind of distribution to output. Can we approximate return distributions with a simple Gaussian? A mixture of Gaussians? Is there a way to output a distribution of arbitrary complexity? Even if we can output really complex distributions, can we sample from that in a tractable way? And once we've decided on how we'll represent the output distribution, we'll then have to choose a new metric to optimize other than mean squared error since we're no longer working with just scalar outputs. Many ways of measuring the difference between probability distributions exist, but we'll have to choose one to use.</p>
<p>These two problems are what the C51 and IQN papers deal with. They both take different approaches to approximating arbitrarily complex return distributions, and they optimize them differently as well. Let's start off with C51: the algorithm itself is a bit complex, but its foundational ideas are rather simple. I won't dive into the math behind C51, and I'll instead save that for IQN since that's the better algorithm.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="C51">C51<a class="anchor-link" href="#C51">¶</a></h1><p>The main idea behind C51 is to approximate the return distribution using a set of discrete bars which the paper authors call 'atoms'. This is like using a histogram to plot out a distribution. It's not the most accurate, but it gives us a good sense of what the distribution looks like in general. This strategy also leads to an optimization strategy that isn't too computationally expensive, which is what we want.</p>
<p>Our network can simply output $N$ probabilities, where all $N$ probabilities sum to $1$. Each of these probabilities represents one of the bars in our distribution approximation. The paper recommends using 51 atoms (network outputs) based on empirical tests, but the algorithm is defined so that you don't need to know the number of atoms beforehand.</p>
<p>To minimize the difference between our current distribution outputs and their target values, the paper recommends minimizing the KL divergence of the two distributions. They accomplish this indirectly by minimizing the cross entropy between the distributions instead.</p>
<p>The idea behind this is simple enough, but the math gets a bit funky. Since the distribution that our network outputs is split into discrete units, the theoretical Bellman update has to be projected into that discrete space and the probabilites of each atom distributed to neighboring atoms to keep the distribution relatively smooth.</p>
<p>To actually use the discretized distribution to make action choices, the paper authors just use the weighted mean of the atoms. This weighted mean is effectively just an approximation of the standard Q-value.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="IQN">IQN<a class="anchor-link" href="#IQN">¶</a></h1><p>C51 works well, but it has some pretty obvious flaws. First off, its distribution approximations aren't going to be very precise. We can use a massive neural network during training, but all those neurons' information gets funneled into just $N$ output atoms at the end of the day. This is the bottleneck on how accurate our network can get, but increasing the number of atoms will increase the amount of computation our algorithm requires.</p>
<p>A second issue with C51 is that it doesn't take full advantage of knowing return distributions. When deciding which actions to take, it just uses the mean of its approximate return distribution. Under optimality, this is really no different than standard Q-learning.</p>
<p>Implicit quantile networks address both of these issues: they allow us to approximate much more complex distributions without additional computation requirements, and they also allow us to easily decide how risky our agent will be when acting.</p>
<h3 id="Implicit-Networks">Implicit Networks<a class="anchor-link" href="#Implicit-Networks">¶</a></h3><p>The first issue with C51 is addressed by not explicitly representing a return distribution with our neural networks. If we do this, then our chosen representation of the distribution acts as a major bottleneck in terms of how accurate our approximations can be. Additionally, sampling from arbitrarily complex distributions is intractable if we want to represent them explicitly. IQN's solution: don't train a network to explicitly represent a distribution, train a network to provide samples from the distribution instead.</p>
<p>Since we aren't explicitly representing any distributions, that means our accuracy bottleneck rests entirely in the size of our neural network. This means we can easily make our distribution approximations more accurate without adding on much to the amount of required computation.</p>
<p>Additionally, since our network is being trained to provide us samples from some unknown distribution, the intractable sampling problem goes away.</p>
<p>The second issue with C51 (not using risk-sensitive policies) is also addressed by using implicit networks. We haven't gone over how we'll actually <em>implement</em> such networks, but trust me when I say that we'll be able to easily manipulate the input to them to induce risky or risk-averse action decisions.</p>
<h3 id="Quantile-Functions">Quantile Functions<a class="anchor-link" href="#Quantile-Functions">¶</a></h3><p>Before we go through the implementation of these myterious implicit networks, we have to go over a few other things about probability distributions that we'll use when deriving the IQN algorithm.</p>
<p>First off, every probability distribution has what's called a cumulative density function (CDF). If the probability of getting the value $35$ out of a probability distribution $P(X)$ is denoted $P(X = 35)$, then the <em>cumulative</em> probability of getting $35$ from that distribution is $P(X \leq 35)$.</p>
<p>The CDF of a distribution does exactly that, excpet it defines a cumulative probability for all possible outputs of the distribution. You can think of the CDF as really just an integral from the beginning of a distribution up to a given point on it. A nice property of CDFs is that their outputs are bounded between 0 and 1. This should be pretty intuitive, since the integral over a probability distribution has to be equal to 1. An example of a CDF for a unit Gaussian distribution is shown below.</p>
<p><img alt="CDF" src="https://computable.ai/images/cdf.svg" /></p>
<p>Quantile functions are closely related to CDFs. In fact, they're just the inverse. CDFs take in an $x$ and return a probability, but quantile functions take in a probability and return an $x$. The quantile function for a unit Gaussian (same as with the previous example CDF) is shown below.</p>
<p><img alt="Quantile Function" src="https://computable.ai/images/quantile.svg" /></p>
<h3 id="Representing-an-Implicit-Distribution">Representing an Implicit Distribution<a class="anchor-link" href="#Representing-an-Implicit-Distribution">¶</a></h3><p>Now we can finally get to the fun stuff: figuring out how to represent an arbitarily complex distribution implicitly. Seeing as I just went on a bit of a detour to talk about quantile functions, you probably already know that that's what we're gonna use. But how and why will that work for us?</p>
<p>First off, quantile functions all have the same input domain, regardless of whatever distribution they're for. Your distribution could be uniform, Gaussian, energy-based, whatever really, and its quantile function would only accept input values between 0 and 1. Since we want to represent any arbitrary distribution, this definitely seems like a property that we want to take advantege of.</p>
<p>Additionally, using quantile functions allows us to sample directly from our distribution without ever having an explicit representation of the distribution. Sampling from the uniform distribution $U([0, 1])$ and passing that as input to our quantile function is equivalent to sampling directly from $Z(s, a)$. Since we can implement this entirely within a neural network, this means there's no major accuracy bottleneck either.</p>
<p>We can also add in another feature to our implicit network to give us the ability to make risk-sensitive policy decisions. We can quite simply distort the input to our quantile network. If we want to make the tails of our distribution less important, for example, then we can map input values closer to 0.5 before passing them to our quantile function.</p>
<h3 id="Formalization">Formalization<a class="anchor-link" href="#Formalization">¶</a></h3><p>We've gone over a lot, so let's take a step back and formalize it a bit. The usual convention for denoting a quantile function over random variable $Z$ (our return) would be $F^{-1}_{Z}(\tau)$, where $\tau \in [0, 1]$. For simplicity's sake, though, we'll define</p>
$$
Z_\tau \doteq F^{-1}_{Z}(\tau)
$$<p>We can also define sampling from $Z(s, a)$ with the following</p>
$$
Z_\tau(s, a), \\
\tau \sim U([0, 1])
$$<p>To distort our $\tau$ values, we'll define a mapping</p>
$$
\beta : [0, 1] \rightarrow [0, 1]
$$<p>Putting these definitions together, we can reclaim a new distorted Q-value</p>
$$
Q_{\beta}(s, a) \doteq \mathbb{E}_{\tau \sim U([0, 1])} [Z_{\beta(\tau)}(s, a)]
$$<p>To define our policy, we can just take whichever action maximizes this distorted Q-value</p>
$$
\pi_{\beta}(s) = \arg\max\limits_{a \in \mathcal{A}} Q_{\beta}(s, a)
$$<h3 id="Optimization">Optimization<a class="anchor-link" href="#Optimization">¶</a></h3><p>Now to figure out a way to iteratively update our distribution approximations... We'll use Huber quantile loss, a nice metric that extends Huber loss to work with quantiles instead of just scalar outputs</p>
$$
\rho^\kappa_\tau(\delta_{ij}) = | \tau - \mathbb{I}\{ \delta_{ij} < 0 \} | \frac{\mathcal{L}_\kappa(\delta_{ij})}{\kappa}, \text{with} \\
\mathcal{L}_\kappa(\delta_{ij}) = \begin{cases}
\frac{1}{2} \delta^2_{ij} &\text{if } | \delta_{ij} < \kappa | \\
\kappa (| \delta_{ij} | - \frac{1}{2} \kappa) &\text{otherwise}
\end{cases}
$$<p>This is a messy loss term, but it essentially tries to minimize TD error while keeping the network's output close to what we expect the quantile function to look like (according to our current approximation).</p>
<p>This loss metric is based on the TD error $\delta_{ij}$, which we can define just like normal TD error</p>
$$
\delta_{ij} = r + \gamma Z_i(s', \pi_\beta(s')) - Z_j(s, a)
$$<p>Notice how in this definition, $i$ and $j$ act as two separate $\tau$ samples from the $U([0, 1])$ distribution. We use two separate $\tau$ samples to keep the terms in the TD error definition decorrelated. To get a more accurate estimation of the loss, we'll sample it multiple times in the following fashion</p>
$$
\mathcal{L} = \frac{1}{N'} \sum_{i=1}^N \sum_{j=1}^{N'} \rho^\kappa_{\tau_i}(\delta_{\tau_i, \tau_j})
$$<p>where $\tau_i$ and $\tau_j$ are both newly sampled for every term in the summation.</p>
<p>Finally, we'll approximate $\pi_\beta$, which we defined earlier, using a similar sampling technique</p>
$$
\tilde{\pi}_\beta(s) = \arg\max\limits_{a \in \mathcal{A}} \frac{1}{K} \sum_{k=1}^K Z_{\beta(\tau_k)}(s, a)
$$<p>where $\tau_k$ is newly sampled every time as well.</p>
<p>That was a lot, but it's all we need to make an IQN. We could spend time thinking about different choices of $\beta$, but that's really a choice that depends on your specific environment. And during implementation, you can just decide that $\beta$ will be the identity function and then change it later if you think you can get better performance with risk-aware action selection.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Review">Review<a class="anchor-link" href="#Review">¶</a></h1><p>We started off with the more obvious way of implementing distributional deep Q-learning, which was explicitly representing the return distribution. Although it worked well, using an explicit representation of the return distribution created an accuracy bottleneck that was hard to overcome. It was also difficult to inject risk-sensitivity into the algorithm.</p>
<p>Using an implicit distribution instead allowed us to get around those two problems, giving us much greater representational power and allowing us much greater control over how our agent handles risk.</p>
<p>Of course, there's always room for improvement. Small techniques like using prioritized experience replay and n-step returns for calculating TD error can be used to make the IQN algorithm more powerful. And since distributional RL is still a pretty new field, there will no doubt be major improvements coming down the academia pipeline to be on the lookout for.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Footnotes">Footnotes<a class="anchor-link" href="#Footnotes">¶</a></h1><p><sup>1</sup> see paper #1, section 6.1 for a short discussion of what the paper authors call 'chattering'</p>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Inaugural Post2019-02-16T00:00:00-05:002019-02-16T00:00:00-05:00Daniel Coxtag:computable.ai,2019-02-16:/articles/2019/Feb/16/inaugural-post.html<p>The purpose statement and introduction to Computable AI.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This post begins the Computable AI blog, a machine intelligence blog from a handful of DRL practitioners, intended to crystalize, internalize, share, and explain.</p>
<p>I found few beginner resources for DRL when I began, and since I have a passion for teaching, this seemed a likely area in which to make a dent.</p>
<p>I also serve as the "Director of Applied Sciences" for a startup software company, and the AI team must occasionally indoctrinate new members. This provides us with a convenient target audience, as well as an expanding pool of co-authors.</p>
<p>Finally, my own education in DRL is incomplete, so this will serve partly as a record of my own journey.</p>
<p>I hope it helps you.</p>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>