Computable AIhttps://computable.ai/2019-08-18T00:00:00-04:00A Machine Intelligence BlogarXiv highlights Aug 11-17 20192019-08-18T00:00:00-04:002019-08-18T00:00:00-04:00Daniel Coxtag:computable.ai,2019-08-18:/articles/2019/Aug/18/arxiv-highlights-aug-11-17-2019.html<p>DRL may not be superhuman on Atari after all, and how to avoid making mistakes like that in the future.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>Just a sketch this week, calling your attention to <a href="https://arxiv.org/abs/1908.04683v1">Is Deep Reinforcement Learning Really Superhuman on Atari?</a>, which concludes not only that DRL is worse than the best humans on most Atari games, but by a <em>wide</em> margin.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="DRL-isn't-superhuman-on-Atari-yet">DRL <em>isn't</em> superhuman on Atari yet<a class="anchor-link" href="#DRL-isn't-superhuman-on-Atari-yet">¶</a></h1><p>Wait, what? I was quite skeptical of this claim. Mnih et al. published the groundbreaking <a href="https://arxiv.org/abs/1312.5602">Playing Atari with Deep Reinforcement Learning</a> in <em>2013</em>, claiming superhuman performance. Surely someone would have noticed by now?</p>
<p>Apparently not, and then most DRL algorithms for the next six years used either the same human scores reported in that paper, or human beginners. It's true that DQN significantly outperformed their own human player, but that player was not, by far, <em>the best in the world</em>. Other recent claims of superhuman performance have proven that claim against the best players in the world (the paper mentions AlphaGo against Lee Sedol, OpenAI Five against OG, and AlphaStar against Mana), but not for the Atari benchmark.</p>
<p>The most poignant detail to me in this paper involved the common "normalized human score", where 0% is the score of a random agent, and 100% is the score of the human baseline. <em>On this scale, the median score achieved by the world record holders across all Atari games is 4.4k%</em>. Clearly you can't claim superhuman performance if there are humans who beat your target by a factor of 44, unless you yourself exceed this score.</p>
<p>For reference, the original Rainbow algorithm achieved a median of 200% over all Atari games, and other algorithms seem to do worse. If the normalized human score is fitted to a maximum equal to the human world record for each game, and run with different time limits, a tuned IQN variant of Rainbow receives a median score of less than 4% (there were other problems with the way benchmarks were done, and correcting for them reduces performance even further).</p>
<p>We have a long way to go then. The paper has a useful analysis drawing on both previous and original research as to <em>why</em> DRL algorithms are so bad at Atari, and I encourage a careful reading. Some of them, such as reward clipping, are called out in previous research as explicitly chosen to improve performance, but (to treat this particular example), it has been mentioned that this causes the agent to prefer many small rewards over a single large reward.</p>
<p>I encourage anyone working with the Atari benchmark to read the paper for themselves.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I actually find it somewhat personally encouraging that there's room for improvement on Atari. It's easy to experiment, and I have some ideas myself.</li>
<li>That said, it is rather scary that we could overlook something like this for so long, as a community.</li>
<li>Anyway, <em>someone</em> will take this as a call to arms, and make progress. Peter Drucker said, "If you can't measure it, you can't improve it." Now that we have better measurements, I predict improvements.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Deep RL Fundamentals #0: What is Deep RL and Why It's Worth Learning2019-08-14T00:00:00-04:002019-08-14T00:00:00-04:00Andrew Farabowtag:computable.ai,2019-08-14:/articles/2019/Aug/14/deep-rl-fundamentals-0-what-is-deep-rl-and-why-its-worth-learning.html<p>An introduction and statement of purpose for a series on the basics of deep reinforcement learning</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Intro">Intro<a class="anchor-link" href="#Intro">¶</a></h3><p>Deep reinforcement learning - the fusion of trial-and-error learning and function-approximating neural networks - is one of the hottest areas of machine learning research right now and is the subject of much excitement, largely, I believe, because of how it resembles the endgame of AI research, artificial general intelligence, in a way that neither supervised nor unsupervised learning does. There is, however, a prevailing attitude that RL is not ready to be put to use in practical scenarios and instead belongs solely in the laboratories of universities and tech giants, conquering toy challenges and video games one at a time until it is ready to emerge. While the many present shortcomings of Deep RL provide good evidence for this viewpoint (some of which I will discuss later in the series), I think Deep RL is ready to tackle many real-world challenges and getting hobbyists/companies involved sooner rather than later would accelerate development. My immediate purposes for writing are to explain what reinforcement learning is and to kick off my post series about the major RL algorithms, but ultimately I want to encourage others to begin hacking away with DRL and try applying it to real-world problems.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="What-is-it?">What is it?<a class="anchor-link" href="#What-is-it?">¶</a></h3><p>Reinforcement learning algorithms attempt to attain a goal by taking actions in their environment, assessing their performance and altering their behavior. The performance assessment comes in the form of a reward signal derived from the environment. This could be the score of Pong game, the time since a humanoid robot last fell over, or a simple binary measure of whether a self-driving car has taken its passenger to the destination successfully or not. In order to maximize reward, any approach to reinforcement learning must have some structure to choose the correct action. In deep reinforcement learning, this is one or more neural networks. Neural networks are ideal because of their ability to generalize in complex, high-dimensional environments. Depending on the approach taken, they can take in the current state of the environment and output either an action to take or the desirability of a certain state.</p>
<p>In this post I use reinforcement learning (RL) and deep reinforcement learning (DRL) interchangeably, however, they refer to slightly different concepts. As Sutton and Barto put it:</p>
<blockquote><p>Reinforcement learning is like many topics with names ending in -ing, such as machine learning, planning, and mountaineering, in that it is simultaneously a problem, a class of solution methods that work well on the class of problems, and the field that studies these problems and their solution methods.</p>
</blockquote>
<p>Deep reinforcement learning is one such type of solution method that utilizes neural networks. Other RL solutions exist, including dynamic programming and tabular reinforcement learning, which uses lookup tables to record the reward associated with encountered states instead of neural networks.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Drawbacks">Drawbacks<a class="anchor-link" href="#Drawbacks">¶</a></h3><p>Modern reinforcement learning is not without its shortcomings. The foremost among these is sample efficiency - the number of times that an algorithm must observe a state, take an action, and improve is currently crippling for many use cases. Atari games that take humans minutes to pick up take state-of-the-art DRL algorithms millions of frames to master $^{1}$. In addition, reinforcement learning algorithms assume that the environment is a Markov decision process. This means that they assume that the optimal action to be taken in a certain state can be determined from a single observation. This poses a problem for many real-life problems that people would want to solve with RL. While recurrent and convolutional neural networks can help, they come at the cost of even worse sample efficiency.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="What’s-next?">What’s next?<a class="anchor-link" href="#What’s-next?">¶</a></h3><p>While I have experience working on a deep reinforcement learning-powered product, there are many areas in which my knowledge is lacking. In writing this post series, I hope to fill in some of those gaps. In the next post, I plan on elaborating upon the standard formulation of reinforcement learning (as described in the beginning of nearly every RL paper) and covering the major traits that differentiate approaches to DRL. I will be using Sutton and Bartoβs Reinforcement Learning as my primary source and I recommend that anyone who is interested pick up a copy or <a href="http://incompleteideas.net/book/the-book-2nd.html">read the free online version</a>.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>$^{1}$ <a href="https://arxiv.org/abs/1710.02298">Rainbow: Combining Improvements in Deep Reinforcement Learning</a></p>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Equivalence between Policy Gradients and Soft Q-Learning2019-08-12T00:00:00-04:002019-08-12T00:00:00-04:00Braden Hoaglandtag:computable.ai,2019-08-12:/articles/2019/Aug/12/equivalence-between-policy-gradients-and-soft-q-learning.html<p>Inspecting the gradients of entropy-augmented policy updates to show their equivalence</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Introduction">Introduction<a class="anchor-link" href="#Introduction">¶</a></h1><p>This article will dive into a lot of the math surrounding the gradients of different maximum entropy RL learning methods. Usually we work in the space of objective functions in practice: with both policy gradients and Q-learning, we'll form an objective function and allow an autodiff library to calculate the gradients for us. We never have to see what's going on behind the scenes, which has its pros and cons. A benefit is that working with objective functions is much easier than calculating gradients by hand. On the other hand, it's easy to lose sight of what's really going on when we work at such an abstract level.</p>
<p>This abstraction issue is tackled in the paper <code>Equivalence Between Policy Gradients and Soft Q-Learning</code> (<a href="https://arxiv.org/abs/1704.06440">https://arxiv.org/abs/1704.06440</a>), and I think it provides some pretty eye-opening insights into what the most common RL algorithms are really doing. I'll be working off of version 4 of the paper from Oct. 2018, the most recent version of the paper at the time of writing.</p>
<p>First I'll walk through some of the basic definitions in the max-entropy RL setting, then I'll pick out the most important bits of math from the paper that show how entropy-augmented Q-learning is really just a policy gradient method.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Maximum-Entropy-RL-and-the-Boltzmann-Policy">Maximum Entropy RL and the Boltzmann Policy<a class="anchor-link" href="#Maximum-Entropy-RL-and-the-Boltzmann-Policy">¶</a></h1><p>In standard RL, we try to maximize expected cumulative reward $\mathbb{E}[\sum_t r_t]$. In the max-entropy setting, we augment this reward signal with an entropy bonus. The expected cumulative reward of a policy $\pi$ is commonly denoted as $\eta(\pi)$</p>
\begin{align*}
\eta(\pi) &= \mathbb{E} \Big[ \sum_t (r_t + \alpha \mathcal{H}(\pi)) \Big] \\
&= \mathbb{E} \Big[ \sum_t \big( r_t - \alpha \log\pi(a_t | s_t) \big) \Big]
\end{align*}<p>where $\pi$ is our current policy and $\alpha$ weights how important the entropy is in our reward definition. This intuitively makes the reward seem higher when our policy exhibits high entropy, allowing it to explore its environment more extensively. A key component of this augmented objective is that the entropy is <em>inside</em> the sum. Thus an optimal policy will not only try to act with high entropy <em>now</em>, but will act in such a way that it finds highly-entropic states in the <em>future</em>.</p>
<p>The paper uses slightly different notation, opting to use KL divergence (AKA "relative entropy") instead of just entropy. This uses a reference policy $\bar{\pi}$, which can be thought of as an old, worse policy that we wish to improve on</p>
\begin{align*}
\eta(\pi) &= \mathbb{E} \Big[ \sum_t (r_t - \alpha \log\pi(a_t|s_t) + \alpha \log\bar{\pi}(a_t|s_t) \Big] \\
&= \mathbb{E} \Big[ \sum_t \big(r_t - \alpha D_{KL}(\pi \,\Vert\, \bar{\pi}) \big) \Big]
\end{align*}<p>In the max-entropy setting, optimal policies are stochastic and proportional to exponential of the optimal Q-function. This can be expressed formally as</p>
$$ \pi^* \propto e^{Q^*(s,a)} $$<p>If this doesn't seem very intuitive, I would recommend a quick scan of the article <a href="https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/">https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/</a>. It offers a brief introduction to max-entropy RL (specifically for Q-learning) and some helpful intuitions as to why the above relationship is a good property for a policy to have.</p>
<p>To actually get a policy in this form, we'll change up the definition slightly</p>
$$
\pi = \frac{\bar{\pi} \, e^{Q(s,a) / \alpha}}{\mathbb{E}_{\bar{a}\sim\bar{\pi}} [e^{Q(s,\bar{a}) / \alpha}]}
$$<p>The numerator of this expression is simply stating that we want our new policy to be like our old policy, but slightly in the direction of $e^Q$. If $\alpha$ is higher (i.e. we want more entropy), we move less in the direction of $e^Q$. The denominator is a normalization constant that ensures that our entire expression is still a valid probability distribution (i.e. the sum over all possible actions comes out to 1).</p>
<p>You may have noticed that the denominator of our policy is really just $e^V$ since $V = \mathbb{E}_{a}[Q]$. We'll use this to simplify our policy</p>
\begin{align*}
V(s) &= \alpha \log \mathbb{E}_{a\sim\bar{\pi}} \big[ e^{Q(s,a)/\alpha} \big] \\
\pi &= \bar{\pi} \, e^{(Q(s,a) - V(s)) / \alpha}
\end{align*}<p>This new policy definition shows more directly that our policy is proportional to the exponential of the advantage. If our policy is proportional to $e^Q$, it should also be proportional to $e^A$, so this makes sense. From now on, we'll refer to this policy as the 'Boltzmann Policy' and denote it $\pi^B$.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Soft-Q-Learning-with-Boltzmann-Backups">Soft Q-Learning with Boltzmann Backups<a class="anchor-link" href="#Soft-Q-Learning-with-Boltzmann-Backups">¶</a></h1><p>From this point onward, there will inevitably be sections of math that seem to leave out non-trivial amounts of work. This is because I think this paper mainly benefits our intuitions about RL. The math proves these new intuitions, but by itself is hard to read. If you're curious and wish to go through all the derivations, I would highly recommend working through the full paper on your own. With that disclaimer out of the way, we can get started...</p>
<p>With normal Q-learning, we define our backup operator $\mathcal{T}$ as follows
$$
\mathcal{T}Q = \mathbb{E}_{r,s'} \big[ r + \gamma \mathbb{E}_{a'\sim\pi}[Q(s', a')] \big]
$$</p>
<p>In the max-entropy setting, we'll have to add in an entropy bonus to the reward signal and simplify accordingly</p>
\begin{align*}
\mathcal{T}Q &= \mathbb{E}_{r,s'} \big[ r + \gamma \mathbb{E}_{a'}[Q(s', a')] - \alpha D_{KL} \big( \pi(\cdot|s') \;\Vert\; \bar{\pi}(\cdot|s') \big) \big] \\
&= \mathbb{E}_{r,s'} \big[ r + \gamma \alpha \log \mathbb{E}_{a'\sim\bar{\pi}}[e^{Q(s',a')/\alpha}] \big]
\end{align*}<p>See equations 11 and 13 from the paper (which rely on equations 2-6) if you want to see just how exactly that simplication works. To actually perform the optimization step $Q \gets \mathcal{T}Q$, we'll minimize the mean squared error between our current $Q$ and an estimate of $\mathcal{T}Q$. Our regression targets can be defined</p>
\begin{align*}
y &= r + \gamma \alpha \log \mathbb{E}_{a'\sim\bar{\pi}} \big[ e^{Q(s', a') / \alpha} \big] \\
&= r + \gamma V(s')
\end{align*}<p>Using Boltzmann backups instead of the traditional Q-learning backups is what transforms normal Q-learning into what's conventionally called "soft" Q-learning. That's really all there is to it.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Policy-Gradients-and-Entropy">Policy Gradients and Entropy<a class="anchor-link" href="#Policy-Gradients-and-Entropy">¶</a></h1><p>I'm assuming you have a solid grasp of policy gradients if you're reading this article, so I'm gonna focus on how they usually aren't applied correctly in the max-entropy setting. PG methods are commonly augmented with an entropy term, like with the following example provided from the paper</p>
$$
\mathbb{E}_{t, s,a} \Big[ \nabla_\theta \log\pi_\theta(a|s) \sum_{t' \geq t} r_{t'} - \alpha D_{KL}\big (\pi_\theta(\cdot|s) \;\Vert\; \pi(\cdot|s) \big) \Big]
$$<p>This example essentially tries to maximize reward-to-go with an entropy for the <em>current</em> timestep. Maximizing this objective technically isn't what we want, even if it's common practice. What we really want is to maximize a sum over all rewards and entropies that our agent experiences from now into the future.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Soft-Q-Learning-=-Policy-Gradient">Soft Q-Learning = Policy Gradient<a class="anchor-link" href="#Soft-Q-Learning-=-Policy-Gradient">¶</a></h1><p>The first of two conclusions that this paper comes to is that Soft Q-Learning and the Policy Gradient have exact first-order equivalence. Using the value function and Boltzmann policy definitions from earlier, we can derive the gradient of $\mathbb{E}_{s,a} \big[ \frac{1}{2} \Vert Q_\theta(s,a) - y \Vert^2 \big]$. The paper is able to produce the following expression</p>
$$
\mathbb{E}_{s,a} \Big[ \color{red}{-\alpha \nabla_\theta \log\pi_\theta(a|s) \Delta_{TD} + \alpha^2 \nabla_\theta D_{KL}\big( \pi_\theta(\cdot|s) \;\Vert\; \bar{\pi}(\cdot|s) \big)} + \color{blue}{\nabla_\theta \frac{1}{2} \Vert V_\theta(s) - \hat{V} \Vert^2} \Big]
$$<p>where $\Delta_{TD}$ is the discounted n-step TD error and $\hat{V}$ is the value regression target formed by $\Delta_{TD}$.</p>
<p>That's kind of a lot, but we can break it down pretty easily. The terms in red represent 1) the usual policy gradient and 2) an additional KL divergence gradient term. The red terms overall represent the gradient you get if you use a policy gradient algorithm with a KL divergence term as your entropy bonus (the actor loss in an actor-critic formulation). The term in blue is quite simply the gradient used to minimize the mean squared error between our current value estimates and our value targets (the critic loss in an actor-critic formulation).</p>
<p>Don't forget that we never explicitly tried to calculate these terms. They came about naturally as an effect of minimizing mean squared error of our Q function and a Boltzmann backup target.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Soft-Q-Learning-and-the-Natural-Policy-Gradient">Soft Q-Learning and the Natural Policy Gradient<a class="anchor-link" href="#Soft-Q-Learning-and-the-Natural-Policy-Gradient">¶</a></h1><p>The next section of the paper details another connection between Soft Q-learning and policy gradient methods, specifically that damped Q-learning updates are exactly equivalent to natural policy gradient updates.</p>
<p>The natural policy gradient weights the policy gradient with the Fisher information matrix $\mathbb{E}_{s,a} \Big[ \big( \nabla_\theta \log\pi_\theta(a|s) \big)^T \big( \nabla_\theta \log\pi_\theta(a|s) \big) \Big]$. The paper shows that the natural policy gradient in the max-entropy setting is equivalent not to soft Q-learning by itself, but instead to a damped version. In this damped version, we calculate a backed-up Q value and then interpolate between it and the current Q value estimate (basically using Polyak averaging instead of running gradient descent on a mean squared error term).</p>
<p>Although not nearly as direct, this connection highlights how higher-order connections between soft Q-learning and policy gradient methods exist. Higher-order equalities between functions point to functions that are increasingly similar, so this connection really drives the point home that soft Q-learning is deceptively like the policy gradient methods we've been using all this time.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Experimental-Results">Experimental Results<a class="anchor-link" href="#Experimental-Results">¶</a></h1><p>The paper authors decided to be nice to us and actually test the theory they derived on some Atari games.</p>
<p>They started out with testing whether or not the usual way of adding entropy bonuses to policy gradient methods is actually worse than the theoretical claims they had just made. As it turns out, using future entropy bonuses $\Big( \text{i.e. } \big( \sum r + \mathcal{H} \big) \Big)$ instead of the simpler, immediate entropy bonus $\Big( \text{i.e. } \big( \sum r \big) + \mathcal{H} \Big)$ results in either similar or superior performance. The below graphs show the results from the experiments, with the future entropy version in blue and the immediate entropy version in red.</p>
<p><img src="https://computable.ai/images/proper_entropy.png" alt="image.png"></p>
<p>They then tested how soft Q-learning compared to normal Q-learning. To make traditional DQN into soft Q-learning, they just modified the regression targets for the Q function. They used the normal target, a target with a KL divergence penalty, and a target with just an entropy bonus. They found that just the entropy bonus resulted in the most improvement, although both soft methods outperformed the "hard" DQN.</p>
<p><img src="https://computable.ai/images/q_hard_soft.png" alt="image.png"></p>
<p>To round things out, they tested soft Q-learning and the policy gradient on the same Atari environments to see if they were equivalent in practice. After all, the math shows that their expectations are equivalent, but the variance of those expectations could be different. The experiments they ran make it seem like the two methods are pretty close to each other, with no method seeming largely superior.</p>
<p><img src="https://computable.ai/images/pg_ql.png" alt="image.png"></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Conclusion-and-Future-Work">Conclusion and Future Work<a class="anchor-link" href="#Conclusion-and-Future-Work">¶</a></h1><p>Hopefully this made you reconsider what's really going on under the hood with Q-learning. Personally, it blew my mind that two seemingly disparate learning methods could boil down to the same expected update. The theoretical possibilities that this connection could lead to is also incredibly exciting.</p>
<p>Of course, this paper focuses its empirical testing just on environemnts with discrete action spaces. Since the Boltzmann policy is intractable to sample from in continuous action spaces, more advanced soft Q-learning algorithms (such as Soft Actor-Critic) are currently being pioneered to get accurate results in those more complicated settings as well.</p>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
arXiv highlights Aug 4-10 20192019-08-11T00:00:00-04:002019-08-11T00:00:00-04:00Daniel Coxtag:computable.ai,2019-08-11:/articles/2019/Aug/11/arxiv-highlights-aug-4-10-2019.html<p>Traffic signal control comparing supervised learning, random search, and deep reinforcement learning</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>This week's paper, <a href="https://arxiv.org/abs/1908.02673v1">Large-scale traffic signal control using machine learning: some traffic flow considerations</a>, caught my eye for several reasons. First, traffic signal control is relevant to my own group's work involving microservice and network traffic management. Second, the authors use cellular automaton rule 184 as their traffic model, which is actually the first time I've seen a cellular automaton used for something serious since <a href="https://www.wolframscience.com/nks/">A New Kind of Science</a>, despite that book's claim about the likely broad usefulness of simple programs for complex purposes. Lastly, the authors find that supervised learning and random search outperform deep reinforcement learning for high-occupancies of the traffic flow network,</p>
<blockquote><p>For occupancies > 75% during training, DRL policies perform very poorly for all traffic conditions, which means that DRL methods cannot learn under highly congested conditions.</p>
</blockquote>
<p>and that they recommend practitioners <em>throw away</em> congested data!</p>
<blockquote><p>Our findings imply that it is advisable for current DRL methods in the literature to discard any congested data when training, and that doing this will improve their performance under all traffic conditions.</p>
</blockquote>
<p>I also have to admit that I've thought to myself, waiting at empty intersections for a light to turn green, that I could just <em>solve</em> this problem with DRL. If I'm wrong, that would be very interesting and surprising.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Considerations-in-a-nutshell">Considerations in a nutshell<a class="anchor-link" href="#Considerations-in-a-nutshell">¶</a></h1><p>The introduction and background are well summarized in their last paragraph:</p>
<blockquote><p>In summary, most recent studies focus on developing effective and robust multi-agent DRL algorithms to achieve coordination among intersections. The number of intersections in those studies are usually limited, thus their results might not apply to large open network. Although the signal control is indeed a continuing problem, it has been always modeled as an episodic process. From the perspective of traffic considerations, expert knowledge has only been incorporated in down-scaling the size of the control problem or designing novel reward functions for DRL algorithm. Few studies have tested their methods given different traffic demands, or shed lights on the learning performance under different traffic conditions, especially the congestion regimes. To fill the gap, our study will treat the large-scale traffic control as a continuing problem and extend classical RL algorithm to fit it. More importantly, noticing the lack of traffic considerations on learning performance, we will train DRL policies under different density levels and explore the results from a traffic flow perspective.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Set-up">Set up<a class="anchor-link" href="#Set-up">¶</a></h1><h2 id="Traffic">Traffic<a class="anchor-link" href="#Traffic">¶</a></h2><p><img src="http://atlas.wolfram.com/01/01/184/01_01_108_184.gif#right" alt="CA Rule 184"></p>
<p>This is elementary cellular automaton (CA) rule 184. Elementary cellular automata operate on a binary vector, producing a new binary vector in each step that's a function of the previous one. For each entry in the previous vector, the new value of the corresponding entry in the resulting vector depends on the previous entry and its neighbors to the left and right. There are 256 possible rules with this formulation, and this picture is of the 184th rule set when ordered in the natural way.</p>
<p>Rule 184 can be thought of as a flow of cars along a lane of traffic. Cars move forward (right) by one cell each step only if there is an open space in front of them, otherwise they wait for one to open up. Here's an example:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [1]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">rule_184</span><span class="p">(</span><span class="n">lane</span><span class="p">):</span>
<span class="n">l</span> <span class="o">=</span> <span class="p">[</span><span class="kc">False</span><span class="p">]</span> <span class="o">+</span> <span class="n">lane</span> <span class="o">+</span> <span class="p">[</span><span class="kc">False</span><span class="p">]</span> <span class="c1"># pad</span>
<span class="k">return</span> <span class="p">[(</span><span class="n">l</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">l</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="ow">or</span> <span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="ow">and</span> <span class="n">l</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">])</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)]</span>
<span class="k">def</span> <span class="nf">show</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">lane</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="n">f</span><span class="s1">'t</span><span class="si">{t}</span><span class="s1">:</span><span class="se">\t</span><span class="s1">'</span><span class="p">,</span> <span class="s1">' '</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s1">'π'</span> <span class="k">if</span> <span class="n">i</span> <span class="k">else</span> <span class="s1">'_'</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">lane</span><span class="p">])</span> <span class="p">)</span>
<span class="n">ti</span> <span class="o">=</span> <span class="p">[</span><span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">7</span><span class="p">):</span>
<span class="n">show</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">ti</span><span class="p">)</span>
<span class="n">ti</span> <span class="o">=</span> <span class="n">rule_184</span><span class="p">(</span><span class="n">ti</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt"></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>t0: π π π π π _ _ π _ _ _ _ _ _ _
t1: π π π π _ π _ _ π _ _ _ _ _ _
t2: π π π _ π _ π _ _ π _ _ _ _ _
t3: π π _ π _ π _ π _ _ π _ _ _ _
t4: π _ π _ π _ π _ π _ _ π _ _ _
t5: _ π _ π _ π _ π _ π _ _ π _ _
t6: _ _ π _ π _ π _ π _ π _ _ π _
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The cellular automaton simulates a lane of traffic, and the authors wire two of these lanes up between each adjacent traffic light to create a grid network. The network is laid out on a torus, so there are no boundaries.</p>
<blockquote><p>The signalized network corresponds to a homogeneous grid network of bidirectional streets, with one lane per direction of length $n = 5$ cells between neighboring traffic lights.</p>
</blockquote>
<p><img src="https://computable.ai/images/signalized_network.png" alt="Signalized network"></p>
<blockquote><p>The connecting links to form the torus are shown as dashed directed links; we have omitted the cells on these links to avoid clutter. Each segment has n = 5 cells; an additional cell has been added downstream of each segment to indicate the traffic light color.</p>
</blockquote>
<p>Cars arriving at a green traffic light choose a random "direction" in which to continue. Green lights are on for a minimum of three steps.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Learning">Learning<a class="anchor-link" href="#Learning">¶</a></h2><p>Each traffic signal is managed by an agent, which has two actions it can take at any time step: turn the light red/green for the North-South approaches, or the opposite. The state observable by each agent is an $8\times n$ matrix of bits corresponding to the four incoming and four outgoing CA vectors, and the output is the probability of turning the light red for the North-South approaches. Only one neural net is actually trained, and used by all agents, since there's no reason for them to be different in this formulation. For the DRL agent, the reward is the <em>incremental</em> average flow per lane (not the average flow per lane), which the authors mention is lower-variance. The authors use a custom infinite-horizon variant of REINFORCE they call REINFORCE-TD.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Experiments">Experiments<a class="anchor-link" href="#Experiments">¶</a></h1><p>The authors use a maximum-queue-first (LQF) greedy algorithm as their baseline for comparison, which services the lane with the longest queue length at all times.</p>
<h2 id="Random-policies">Random policies<a class="anchor-link" href="#Random-policies">¶</a></h2><p><img src="https://computable.ai/images/traffic_signals_figure4.png" alt="Figure 4"></p>
<p>They begin by randomly reinitializing the parameters of the neural network, and discover that ~15% of random policies are competitive (that is, they can outperform LQF for some traffic densities). They also note a previously undiscovered pattern that "all policies, no matter how bad, are best when the density exceeds approximately 75%." How odd.</p>
<h2 id="Supervised-learning-policies">Supervised learning policies<a class="anchor-link" href="#Supervised-learning-policies">¶</a></h2><p><img src="https://computable.ai/images/traffic_signals_figure5.png" alt="Figure 5"></p>
<p>They then train a policy with supervised learning, and surprisingly, with only the two obvious extreme examples, the resulting policy is near-optimal.</p>
<h2 id="DRL-policies">DRL policies<a class="anchor-link" href="#DRL-policies">¶</a></h2><p><img src="https://computable.ai/images/traffic_signals_figure6.png" alt="Figure 6"></p>
<blockquote><p>Policies trained with constant demand and random initial parameters $\theta$. The label in each diagram gives the iteration number and the constant density value. First column: NS red probabilities of the extreme states, $\pi(s1)$ in dashed line and $\pi(s2)$ in solid line. The remaining columns show the flow-density diagrams obtained at different iterations, and the last column shows the iteration producing the highest flow at $k = 0.5$, if not reported on a earlier column.</p>
</blockquote>
<p>Finally, they run two experiments with DRL policies, as described above. These policies seem to do rather poorly in general compared to random search and supervised learning, and as density increases, they stop learning much of anything.</p>
<blockquote><p>We conjecture that this result is a consequence of a property of congested urban networks and has nothing to do with the algorithm to train the DRL policy.</p>
</blockquote>
<p>I'm skeptical. See my parting thoughts.</p>
<p>The other experiments the authors perform just confirms that average flow per lane does worse than incremental average flow per lane.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>In the end, I'm way more interested in the experimental setup of this paper than the conclusions. As usual, I learned a ton, and I may actually use rule 184 as a model for traffic flow on something.</li>
<li>Isn't it <em>obvious</em> given their problem formulation that the agents can't learn under conditions of congestion, since it means their input is essentially whited out? I would be more impressed with the conclusion if a neural net with complete visibility had trouble learning with congestion. It also seems to me <em>extremely</em> suggestive that a supervised policy can learn from only two examples, and I would very much like to see if the major conclusions of this paper explode with a more realistic network topology. Queueing theory contains all sorts of counterintuitive surprises, and it seems likely to me that their results are more indicative of one of those surprises, rather than some deep fact about DRL's ability to manage urban congestion.</li>
<li>It's interesting that they formulate the problem as a continuing one, against the prevailing trend in the traffic signal control literature. I agree with them, that even if you get to a state where there's no traffic, that's a function of the demand, not of the agent's choices. I bring this up because I too have found that it's <em>really quite important</em> to recognize an infinite-horizon problem when you have one, or else your agent learns to rack up debts until the end of the artificial episode when all is "forgiven".</li>
<li>It's fascinating that all random policies, no matter how bad, are best around 75% congestion. I have been admonished to avoid scheduling myself at more than 70% capacity to avoid the ringing effect. I wonder if this is an empirical vindication of that...</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
arXiv highlights July 28-August 3 20192019-08-04T00:00:00-04:002019-08-04T00:00:00-04:00Daniel Coxtag:computable.ai,2019-08-04:/articles/2019/Aug/04/arxiv-highlights-july-28-august-3-2019.html<p>Hierarchical RL for concurrent discovery of compound and composable policies.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>Just a sketch this week, of <a href="https://arxiv.org/abs/1905.09668">Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies</a>.</p>
<p>I've been hearing hierarchical RL mentioned frequently lately, and while I understand it's a way to encode human expertise to achieve otherwise intractible goals, it has also seemed a bit like cheating. However, I have a day job, and this serves as a healthy dose of pragmatism. I also think that even when the goal is fundamental progress, it's often a good idea to achieve the goal <em>in any way possible</em>, and then follow-up by working the cheats out of the system one by one. So when I read the abstract of this paper, I was feeling more receptive than previously.</p>
<p>Part of what made hierarchical RL seem not worth the cheating was how kludgy and inefficient the usual methods were, retraining a whole new policy from scratch for each subtask. That's why this week's paper caught my eye:</p>
<blockquote><p>... we propose an algorithm for learning both compound and composable policies <strong>within the same learning process</strong> by exploiting the off-policy data generated from the compound policy.</p>
</blockquote>
<p>Their resulting algorithm, "Hierarchical Intentional-Unintentional Soft Actor-Critic" (HIU-SAC), efficiently trains all sub-policies simultaneously, choosing actions to perform in the environment using a weighted average of the "votes" of all sub-policies, with weights given by a learned selector network (which is <em>also</em> simultaneously trained).</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Composable-hierarchical-RL">Composable hierarchical RL<a class="anchor-link" href="#Composable-hierarchical-RL">¶</a></h1><h2 id="Architecture">Architecture<a class="anchor-link" href="#Architecture">¶</a></h2><p><img alt="Hierarchical policy diagram" src="https://computable.ai/images/policy_network.png#right" height="300px" width="300px" style="margin: 10px" /></p>
<p>The composite policy consists of the individual policy networks, each with its own reward function, trained to take observations $s$ in and output parameters of a conditional Gaussian. There is also a special activation vector selector network trained on the same states to produce weights corresponding to how much each constituent policy applies to the current state. All of these networks share early layers, since they all benefit from an accurate high-level state representation. Finally, some function $f$ takes all of these outputs and determines what action $a$ to <em>actually</em> take in the environment.</p>
<p><img alt="Q-value function diagram" src="https://computable.ai/images/q_fcn_network.png#left" height="250px" width="250px" style="margin: 10px" /></p>
<p>The Q function networks are similarly arranged, sharing early layers which take a state $s$ and an action $a$ to produce a Q function for each subtask, as well as a composite Q function.</p>
<div style="clear:both"> </div><h2 id="Simultaneous-learning">Simultaneous learning<a class="anchor-link" href="#Simultaneous-learning">¶</a></h2><blockquote><p>Most methods learn the composable tasks one at a time, and later, the compound task. This procedure is not scalable as all the experience collected for each learning process is only used for that specific process. Also, it is not possible to start learning more complex tasks unless all the compos- able policies have been successfully learned. The method proposed in this section is based on the idea that a single stream of experience can be used to improve not only the policy that is generating the behavior but also, indirectly, many other policies.</p>
</blockquote>
<p>The authors refer to the composite policy acting as the "intentional" policy (the "behavior" policy in an off-policy setting), and the composable sub-policies as the "unintentional" policies (each one a "target" policy in an off-policy setting). They use a variation on SAC to train the composite and composable policies simultaneously within the maximum entropy framework.</p>
<p>The objective function for the Q networks simply maximize the expected sum of all mean-squared Bellman errors for each Q network, for each tuple in the replay buffer $\mathcal{D}$. The objective function for the policy is simply the sum of the objective functions for each intentional and unintentional policy. Each policy objective optimizes the expected difference for each state in $\mathcal{D}$ between the Q value and log-probability of the selected action (adjustable by temperature $\alpha$), over all possible actions. HIU-SAC then alternates between policy evaluation and policy improvement steps following SAC.</p>
<h2 id="The-importance-of-maximizing-entropy-to-adequate-exploration">The importance of maximizing entropy to adequate exploration<a class="anchor-link" href="#The-importance-of-maximizing-entropy-to-adequate-exploration">¶</a></h2><p>It is interesting that the entropy-maximizing RL objective was <em>absolutely necessary</em> for exploring broadly enough to train all of these policies at once.</p>
<blockquote><p>Note that populating the replay memory buffer with rich experiences is essential for acquiring multiple skills in an off-policy manner. The composable policies learned unintentionally had similar performance than the policies obtained in single-task formulations only when the compound policy was able to efficiently explore the environment. For this reason, the algorithm was built on a maximum entropy RL framework to favor exploration during the learning process.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>In a way, the methods proposed here seem rather obvious, and I found this paper quite easy to understand given that it violated none of my expectations. I also haven't been paying enough attention to hierarchical RL to know off-hand why training the sub-policies in parallel off of the same recorded environment interactions hasn't been tried before (or whether it has been without my notice). Perhaps it was necessary for off-policy RL to reach a level of maturity sufficient for sub-policies to see enough relevant data to train? In any case, don't hear me faulting the authors for trying the obvious. It is relieving a <em>non</em>-obvious that a straightforward formulation works so well.</li>
<li>I'd love to see this work combined with imitation learning and inverse RL to figure out what sub-policies are necessary in the first place from demonstrations. That seems like a very practical framework for real-world learning.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
arXiv highlights July 21-27 20192019-07-28T00:00:00-04:002019-07-28T00:00:00-04:00Daniel Coxtag:computable.ai,2019-07-28:/articles/2019/Jul/28/arxiv-highlights-july-21-27-2019.html<p>Efficient exploration with self-imitation learning.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>Several paper caught my eye this week, but I'll be discussing only <a href="https://arxiv.org/abs/1907.10247">Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy</a> in more depth. I'm choosing this paper because, as happens sometimes, I had this idea myself a few weeks ago. It's especially exciting to see something you suspected might improve the world fleshed out and vindicated.</p>
<p>This is the basic form of my shower-throught idea:</p>
<blockquote><p>This paper investigates the imitation of diverse past trajectories and how that leads [to] further exploration and avoids getting stuck at a sub-optimal behavior. Specifically, we propose to use a buffer of the past trajectories to cover diverse possible directions. Then we learn a trajectory-conditioned policy to imitate any trajectory from the buffer, treating it as a demonstration. After completing the demonstration, the agent performs random exploration.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-problem">The problem<a class="anchor-link" href="#The-problem">¶</a></h1><p><img src="https://computable.ai/images/maze_icon_map.png#right" alt="Maze"></p>
<p>The main problem the authors want to solve is insufficient exploration leading to a sub-optimal policy. If you don't explore your environment enough, you will find local rewards, but miss globally optimal rewards. In this maze (their Figure 1), you can see that an agent that fails to explore will collect two apples in the next room, but may miss acquiring the key, unlocking the door, collecting an apple, and discovering the treasure.</p>
<p>In the notoriously difficult Atari game (for RL agents) Montezuma's Revenge, it is similarly extremely unlikely that random exploration suffices to explore the environment and achieve a high score. The authors report state-of-the-art performance without expert demonstrations on Montezuma's Revenge, netting 25k points.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="SOTA-without-demonstrations">SOTA without demonstrations<a class="anchor-link" href="#SOTA-without-demonstrations">¶</a></h1><p>So, more precisely, how did they achieve this, and why does it work?</p>
<blockquote><p>The main idea of our method is to maintain a buffer of diverse trajectories collected during training and to train a trajectory-conditioned policy by leveraging reinforcement learning and supervised learning to roughly follow demonstration trajectories sampled from the trajectory buffer. Therefore, the agent is encouraged to explore beyond various visited states in the environment and gradually push its exploration frontier further... We name our method as Diverse Trajectory-conditioned Self-Imitation Learning (DTSIL).</p>
</blockquote>
<h2 id="The-trajectory-buffer">The trajectory buffer<a class="anchor-link" href="#The-trajectory-buffer">¶</a></h2><p>Their trajectory buffer $\mathcal{D}$ contains $N$ 3-tuples $\{\left(e^{(1)}, \tau^{(1)}, n^{(1)}\right), \left(e^{(2)}, \tau^{(2)}, n^{(2)}\right), \ldots \left(e^{(N)}, \tau^{(N)}, n^{(N)}\right) \}$ where $e^{(i)}$ is a high-level state representation, $\tau^{(i)}$ is the shortest trajectory achieving the highest reward and arriving at $e^{(i)}$, and $n^{(i)}$ is the number of times $e^{(i)}$ has been encountered. Whenever they roll out a new episode, they check each high-level state representation encountered against those in $\mathcal{D}$, increment $n$, and if $\tau$ is better they replace $\tau$ for that entry.</p>
<h2 id="Sampling">Sampling<a class="anchor-link" href="#Sampling">¶</a></h2><p>When training their trajectory-conditioned policy, they sample each 3-tuple with weight ${1}\over{\sqrt{n^{(i)}}}$. Notice that this will cause them to sample <em>less</em> frequently-visited states more often, encouraging exploration.</p>
<h2 id="Imitation-reward">Imitation reward<a class="anchor-link" href="#Imitation-reward">¶</a></h2><p>Given a trajectory $g$ sampled from the buffer, and during interaction with the environment, the agent receives a positive reward if the current state has an embedding within some $\Delta t$ of the current timestep in $g$. Otherwise the imitation reward is 0. Once it reaches the end of $g$, there is no further imitation reward, and it explores randomly. The imitation reward is one of two components of the $r^{DTSIL}_{t}$ RL reward, where the other is a simple monotonic function of the reward received at each timestep.</p>
<h2 id="Policy-architecture">Policy architecture<a class="anchor-link" href="#Policy-architecture">¶</a></h2><p>The DTSIL policy architecture is recurrent and attentional, inspired by machine translation!</p>
<blockquote><p>Inspired by neural machine translation methods, the demonstration trajectory is the source sequence and the incomplete trajectory of the agentβs state representations is the target sequence. We apply a recurrent neural network and an attention mechanism to the sequence data to predict actions that would make the agent to follow the demonstration trajectory.</p>
</blockquote>
<h2 id="RL-objective">RL objective<a class="anchor-link" href="#RL-objective">¶</a></h2><p>DTSIL is trained using a policy gradient algorithm (PPO, in their experiments), and RL loss</p>
$$\mathcal L^{RL} = {\mathbb{E}}_{\pi_\theta} [-\log \pi_\theta(a_t|e_{\leq t}, o_t, g) \widehat{A}_t]$$<p>where $$\widehat{A}_t=\sum^{n-1}_{d=0} \gamma^{d}r^\text{DTSIL}_{t+d} + \gamma^n V_\theta(e_{\leq t+n}, o_{t+n}, g) - V_\theta(e_{\leq t}, o_t, g)$$</p>
<h2 id="SL-objective">SL objective<a class="anchor-link" href="#SL-objective">¶</a></h2><p>In each parameter optimization step, they also include a supervised loss designed to maximize the log probability of taking an action that imitates the chosed demonstration exactly to better leverage a past trajectory $g$.</p>
$$\mathcal L^\text{SL} = - \log \pi_\theta(a_t|e_{\leq t}, o_t, g) \text{, where } g = \{e_0, e_1, \cdots, e_{|g|}\}$$<h2 id="Optimization">Optimization<a class="anchor-link" href="#Optimization">¶</a></h2><p>The final parameter update is thus</p>
$$\theta \gets \theta - \eta \nabla_\theta (\mathcal{L}^\text{RL}+\beta \mathcal{L}^\text{SL})$$
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I <em>love</em> seeing methods developed for generative language models used in another context entirely, to generate another kind of sequence. I'm overjoyed that it worked well.</li>
<li>They need a high-level embedding for two reasons: first because storing entire trajectories exactly in memory is expensive, and second because it's quite difficult to re-execute a previously-encountered trajectory exectly, so in order for this method to work at all it's important that an <em>approximate</em> re-execution be possible.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Distributional Deep Q-Learning2019-07-26T00:00:00-04:002019-07-26T00:00:00-04:00Braden Hoaglandtag:computable.ai,2019-07-26:/articles/2019/Jul/26/distributional-deep-q-learning.html<p>Expanding DQN to produce estimates of return distributions, and an exploration into why this helps learning</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Overview">Overview<a class="anchor-link" href="#Overview">¶</a></h1><p>I recently stumbled upon the world of distributional Q-learning, and I hope to share some of the insights I've made from reading the following papers:</p>
<ul>
<li>A Distributional Perspective on Reinforcement Learning: <a href="https://arxiv.org/abs/1707.06887">https://arxiv.org/abs/1707.06887</a></li>
<li>Implicit Quantile Networks for Distributional Reinforcement Learning: <a href="https://arxiv.org/abs/1806.06923">https://arxiv.org/abs/1806.06923</a></li>
</ul>
<p>This article will loosely work through the two papers in order, as they build on each other, but hopefully I can trim off most of the extraneous information and present you with a nice overview of distributional RL, how it works, and how to improve upon the most basic distributional algorithms to get to the current state-of-the-art.</p>
<p>First I'll introduce distributional Q-learning and try to provide some motivations for using it. Then I'll highlight the strategies used in the development of C51, one of the first highly successful distributional Q-learning algorithms (paper #1). Then I'll introduce implicit quantile networks (IQNs) and explain their improvements to C51 (paper #2).</p>
<p><em>Quick disclaimer: I'm assuming you're familiar with how Q-learning works. That includes V and Q functions, Bellman backups, and the various learning stability tricks like target networks and replay buffers that are commonly used.</em></p>
<p><em>Another important note is that these algorithms are only for <strong>discrete</strong> action spaces.</em></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Motivations-for-Distributional-Deep-Q-Learning">Motivations for Distributional Deep Q-Learning<a class="anchor-link" href="#Motivations-for-Distributional-Deep-Q-Learning">¶</a></h1><p>In standard Q-Learning, we attempt to learn a function $Q(s, a): \mathcal{S \times A} \rightarrow \mathbb{R}$ that maps state-action pairs to the expected return from that state-action pair. This gives us a pretty accurate idea of how good specific actions are in specific states (if our $Q$ is accurate), but it's missing some information. There exist distributions of returns that we can receive from each state-action pair, and the expectations/means of these distributions is what $Q$ attempts to learn. But why only learn the expectation? Why not try to learn the whole distribution?</p>
<p>Before diving into the algorithms that have been developed for this specific purpose, it's helpful to think about why this is beneficial in the first place. After all, learning a distribution is a lot more complicated than learning a single number, and we don't want to waste precious computational resources on doing something that doesn't help much.</p>
<h3 id="Stabilized-Learning">Stabilized Learning<a class="anchor-link" href="#Stabilized-Learning">¶</a></h3><p>The first possibility I'll throw out there is that learning distributions could stabilize learning. This may seem unintuitive at first, seeing as we're trying to learn something much more complicated than an ordinary $Q$ function. But let's think about what happens when stochasticity in our environment results in our agent receiving a highly unusual return. I'll use the example of driving a car through an intersection.</p>
<p>Let's say you're waiting at a red light that turns green. You begin to drive forward, expecting to simply cruise through the intersection and be on your way. Your internal model of your driving is probably saying "there's no way anything bad will happen if you go straight right now", and there's no reason to think otherwise. But now let's say another driver on the road perpendicular to yours runs straight through their red light and crashes into you. You would be right to be incredibly surprised by this turn of events (and hopefully not dead, either), but how surprised should you be?</p>
<p>If your internal driving model was based only on expected returns, then you wouldn't predict that this accident would occur at all. And since it just <em>did</em> happen, you may be tempted to drastically change your internal model and, as a result, be scared of intersections for quite a bit until you're convinced that they're safe again; however, what if your driving model was based on a <em>distribution</em> over all possible returns? If you mentally assigned a probability of 0.00001 to this accident occurring, and if you've driven through 100,000 intersections before throughout your lifetime, then this accident isn't really that surprising. It still totally sucks and your car is probably totaled, but you shouldn't be irrationally scared of intersections now. After all, you just proved that your model was right!</p>
<p>So yeah that's kinda dark, but I think it highlights how learning a distribution instead of an expectation can reduce the effects of environment stochasticity<sup>1</sup></p>
<h3 id="Risk-Sensitive-Policies">Risk Sensitive Policies<a class="anchor-link" href="#Risk-Sensitive-Policies">¶</a></h3><p>Using distributions over returns also allows us to create brand new classes of policies that take risk into account when deciding which actions to take. I'll use another example that doesn't involve driving but is equally as deadly :) Let's say you need to cross a gorge in the shortest amount of time possible (I'm not sure why, but you do. This is a poorly formulated example). You have two options: using a sketchy bridge that looks like it may fall apart at any moment, or you could walk down a set of stairs on one side of the gorge and then up a set of stairs on the other side. The latter option is incredibly safe. It'll still take significantly longer than using the bridge, though, so is it worth it?</p>
<p>For the purposes of this example, let's give dying a reward of $-1000$ and give every non-deadly interaction with the environment a reward of $-1$. Let's also say that taking the bridge gets you across the gorge in $10$ seconds with probability $0.5$ of making it across safely. Taking the stairs gets you across the gorge $100%$ of the time, but it takes $100$ seconds instead.</p>
<p>Given this information, we can quickly calculate expected returns for each of the two actions</p>
$$
\mathbb{E}[\text{return}_\text{bridge}] = (-1000 * 0.5) + (-10 * 0.5) = -505 \\
\mathbb{E}[\text{return}_\text{stairs}] = -100
$$<p>If you made decisions like a standard Q-learning agent, you would never take the bridge. The expected return is much worse than that of taking the stairs, so there's no reason to choose it. But if you made decisions like a distributional Q-learning agent, your decision can be much more well informed. You can be aware of the probability of dying vs. getting across the gorge more quickly by using the bridge. If the risk of falling to your death is worth it in your particular situation (let's say you're being chased by a wild animal who can run much faster than you), then taking the bridge instead of the stairs could end up being what you want.</p>
<p>Although this example was pretty contrived, it highlights how using return distributions allows us to choose policies that before would have been impossible to formulate. Want a policy that takes as little risk as possible? We can do that now. Want a policy that takes as much risk as possible? Go right ahead, but please don't fall into any gorges.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-Distributional-Q-Learning-Framework">The Distributional Q-Learning Framework<a class="anchor-link" href="#The-Distributional-Q-Learning-Framework">¶</a></h1><p>So now we have a few reasons why using distributions over returns instead of just expected return can be useful, but we need to formulate a few things first so that we can use Q-learning strategies in this new setting.</p>
<p>We'll define $Z(s, a)$ to be the distribution of returns at a given state-action pair, where $Q(s, a)$ is the expected value of $Z(s, a)$.</p>
<p>The usual Bellman equation for $Q$ is defined</p>
$$
Q(s, a) = \mathbb{E}[r(s, a)] + \gamma \mathbb{E}[Q(s', a')]
$$<p>Now we'll change this to be defined in terms of entire distributions instead of just expectations by using $Z$ instead of $Q$. We'll denote the distribution of rewards for a single state-action pair $R(s,a)$.</p>
$$
Z(s, a) = R(s, a) + \gamma Z(s', a')
$$<p>All we need now is a way of iteratively enforcing this Bellman constraint on our $Z$ function. With standard Q-learning, we can do that quite simply by minimizing mean squared error between the outputs of a neural network (which approximates $Q$) and the values $\mathbb{E}[r(s, a)] + \gamma \mathbb{E}[Q(s', a')]$ computing using a target Q-network and transitions sampled from a replay buffer.</p>
<p>Such a straightforward solution doesn't exist in the distributional case because the output from our Z-network is so much more complex than from a Q-network. First we have to decide what kind of distribution to output. Can we approximate return distributions with a simple Gaussian? A mixture of Gaussians? Is there a way to output a distribution of arbitrary complexity? Even if we can output really complex distributions, can we sample from that in a tractable way? And once we've decided on how we'll represent the output distribution, we'll then have to choose a new metric to optimize other than mean squared error since we're no longer working with just scalar outputs. Many ways of measuring the difference between probability distributions exist, but we'll have to choose one to use.</p>
<p>These two problems are what the C51 and IQN papers deal with. They both take different approaches to approximating arbitrarily complex return distributions, and they optimize them differently as well. Let's start off with C51: the algorithm itself is a bit complex, but its foundational ideas are rather simple. I won't dive into the math behind C51, and I'll instead save that for IQN since that's the better algorithm.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="C51">C51<a class="anchor-link" href="#C51">¶</a></h1><p>The main idea behind C51 is to approximate the return distribution using a set of discrete bars which the paper authors call 'atoms'. This is like using a histogram to plot out a distribution. It's not the most accurate, but it gives us a good sense of what the distribution looks like in general. This strategy also leads to an optimization strategy that isn't too computationally expensive, which is what we want.</p>
<p>Our network can simply output $N$ probabilities, where all $N$ probabilities sum to $1$. Each of these probabilities represents one of the bars in our distribution approximation. The paper recommends using 51 atoms (network outputs) based on empirical tests, but the algorithm is defined so that you don't need to know the number of atoms beforehand.</p>
<p>To minimize the difference between our current distribution outputs and their target values, the paper recommends minimizing the KL divergence of the two distributions. They accomplish this indirectly by minimizing the cross entropy between the distributions instead.</p>
<p>The idea behind this is simple enough, but the math gets a bit funky. Since the distribution that our network outputs is split into discrete units, the theoretical Bellman update has to be projected into that discrete space and the probabilites of each atom distributed to neighboring atoms to keep the distribution relatively smooth.</p>
<p>To actually use the discretized distribution to make action choices, the paper authors just use the weighted mean of the atoms. This weighted mean is effectively just an approximation of the standard Q-value.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="IQN">IQN<a class="anchor-link" href="#IQN">¶</a></h1><p>C51 works well, but it has some pretty obvious flaws. First off, its distribution approximations aren't going to be very precise. We can use a massive neural network during training, but all those neurons' information gets funneled into just $N$ output atoms at the end of the day. This is the bottleneck on how accurate our network can get, but increasing the number of atoms will increase the amount of computation our algorithm requires.</p>
<p>A second issue with C51 is that it doesn't take full advantage of knowing return distributions. When deciding which actions to take, it just uses the mean of its approximate return distribution. Under optimality, this is really no different than standard Q-learning.</p>
<p>Implicit quantile networks address both of these issues: they allow us to approximate much more complex distributions without additional computation requirements, and they also allow us to easily decide how risky our agent will be when acting.</p>
<h3 id="Implicit-Networks">Implicit Networks<a class="anchor-link" href="#Implicit-Networks">¶</a></h3><p>The first issue with C51 is addressed by not explicitly representing a return distribution with our neural networks. If we do this, then our chosen representation of the distribution acts as a major bottleneck in terms of how accurate our approximations can be. Additionally, sampling from arbitrarily complex distributions is intractable if we want to represent them explicitly. IQN's solution: don't train a network to explicitly represent a distribution, train a network to provide samples from the distribution instead.</p>
<p>Since we aren't explicitly representing any distributions, that means our accuracy bottleneck rests entirely in the size of our neural network. This means we can easily make our distribution approximations more accurate without adding on much to the amount of required computation.</p>
<p>Additionally, since our network is being trained to provide us samples from some unknown distribution, the intractable sampling problem goes away.</p>
<p>The second issue with C51 (not using risk-sensitive policies) is also addressed by using implicit networks. We haven't gone over how we'll actually <em>implement</em> such networks, but trust me when I say that we'll be able to easily manipulate the input to them to induce risky or risk-averse action decisions.</p>
<h3 id="Quantile-Functions">Quantile Functions<a class="anchor-link" href="#Quantile-Functions">¶</a></h3><p>Before we go through the implementation of these myterious implicit networks, we have to go over a few other things about probability distributions that we'll use when deriving the IQN algorithm.</p>
<p>First off, every probability distribution has what's called a cumulative density function (CDF). If the probability of getting the value $35$ out of a probability distribution $P(X)$ is denoted $P(X = 35)$, then the <em>cumulative</em> probability of getting $35$ from that distribution is $P(X \leq 35)$.</p>
<p>The CDF of a distribution does exactly that, excpet it defines a cumulative probability for all possible outputs of the distribution. You can think of the CDF as really just an integral from the beginning of a distribution up to a given point on it. A nice property of CDFs is that their outputs are bounded between 0 and 1. This should be pretty intuitive, since the integral over a probability distribution has to be equal to 1. An example of a CDF for a unit Gaussian distribution is shown below.</p>
<p><img alt="CDF" src="https://computable.ai/images/cdf.svg" /></p>
<p>Quantile functions are closely related to CDFs. In fact, they're just the inverse. CDFs take in an $x$ and return a probability, but quantile functions take in a probability and return an $x$. The quantile function for a unit Gaussian (same as with the previous example CDF) is shown below.</p>
<p><img alt="Quantile Function" src="https://computable.ai/images/quantile.svg" /></p>
<h3 id="Representing-an-Implicit-Distribution">Representing an Implicit Distribution<a class="anchor-link" href="#Representing-an-Implicit-Distribution">¶</a></h3><p>Now we can finally get to the fun stuff: figuring out how to represent an arbitarily complex distribution implicitly. Seeing as I just went on a bit of a detour to talk about quantile functions, you probably already know that that's what we're gonna use. But how and why will that work for us?</p>
<p>First off, quantile functions all have the same input domain, regardless of whatever distribution they're for. Your distribution could be uniform, Gaussian, energy-based, whatever really, and its quantile function would only accept input values between 0 and 1. Since we want to represent any arbitrary distribution, this definitely seems like a property that we want to take advantege of.</p>
<p>Additionally, using quantile functions allows us to sample directly from our distribution without ever having an explicit representation of the distribution. Sampling from the uniform distribution $U([0, 1])$ and passing that as input to our quantile function is equivalent to sampling directly from $Z(s, a)$. Since we can implement this entirely within a neural network, this means there's no major accuracy bottleneck either.</p>
<p>We can also add in another feature to our implicit network to give us the ability to make risk-sensitive policy decisions. We can quite simply distort the input to our quantile network. If we want to make the tails of our distribution less important, for example, then we can map input values closer to 0.5 before passing them to our quantile function.</p>
<h3 id="Formalization">Formalization<a class="anchor-link" href="#Formalization">¶</a></h3><p>We've gone over a lot, so let's take a step back and formalize it a bit. The usual convention for denoting a quantile function over random variable $Z$ (our return) would be $F^{-1}_{Z}(\tau)$, where $\tau \in [0, 1]$. For simplicity's sake, though, we'll define</p>
$$
Z_\tau \doteq F^{-1}_{Z}(\tau)
$$<p>We can also define sampling from $Z(s, a)$ with the following</p>
$$
Z_\tau(s, a), \\
\tau \sim U([0, 1])
$$<p>To distort our $\tau$ values, we'll define a mapping</p>
$$
\beta : [0, 1] \rightarrow [0, 1]
$$<p>Putting these definitions together, we can reclaim a new distorted Q-value</p>
$$
Q_{\beta}(s, a) \doteq \mathbb{E}_{\tau \sim U([0, 1])} [Z_{\beta(\tau)}(s, a)]
$$<p>To define our policy, we can just take whichever action maximizes this distorted Q-value</p>
$$
\pi_{\beta}(s) = \arg\max\limits_{a \in \mathcal{A}} Q_{\beta}(s, a)
$$<h3 id="Optimization">Optimization<a class="anchor-link" href="#Optimization">¶</a></h3><p>Now to figure out a way to iteratively update our distribution approximations... We'll use Huber quantile loss, a nice metric that extends Huber loss to work with quantiles instead of just scalar outputs</p>
$$
\rho^\kappa_\tau(\delta_{ij}) = | \tau - \mathbb{I}\{ \delta_{ij} < 0 \} | \frac{\mathcal{L}_\kappa(\delta_{ij})}{\kappa}, \text{with} \\
\mathcal{L}_\kappa(\delta_{ij}) = \begin{cases}
\frac{1}{2} \delta^2_{ij} &\text{if } | \delta_{ij} < \kappa | \\
\kappa (| \delta_{ij} | - \frac{1}{2} \kappa) &\text{otherwise}
\end{cases}
$$<p>This is a messy loss term, but it essentially tries to minimize TD error while keeping the network's output close to what we expect the quantile function to look like (according to our current approximation).</p>
<p>This loss metric is based on the TD error $\delta_{ij}$, which we can define just like normal TD error</p>
$$
\delta_{ij} = r + \gamma Z_i(s', \pi_\beta(s')) - Z_j(s, a)
$$<p>Notice how in this definition, $i$ and $j$ act as two separate $\tau$ samples from the $U([0, 1])$ distribution. We use two separate $\tau$ samples to keep the terms in the TD error definition decorrelated. To get a more accurate estimation of the loss, we'll sample it multiple times in the following fashion</p>
$$
\mathcal{L} = \frac{1}{N'} \sum_{i=1}^N \sum_{j=1}^{N'} \rho^\kappa_{\tau_i}(\delta_{\tau_i, \tau_j})
$$<p>where $\tau_i$ and $\tau_j$ are both newly sampled for every term in the summation.</p>
<p>Finally, we'll approximate $\pi_\beta$, which we defined earlier, using a similar sampling technique</p>
$$
\tilde{\pi}_\beta(s) = \arg\max\limits_{a \in \mathcal{A}} \frac{1}{K} \sum_{k=1}^K Z_{\beta(\tau_k)}(s, a)
$$<p>where $\tau_k$ is newly sampled every time as well.</p>
<p>That was a lot, but it's all we need to make an IQN. We could spend time thinking about different choices of $\beta$, but that's really a choice that depends on your specific environment. And during implementation, you can just decide that $\beta$ will be the identity function and then change it later if you think you can get better performance with risk-aware action selection.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Review">Review<a class="anchor-link" href="#Review">¶</a></h1><p>We started off with the more obvious way of implementing distributional deep Q-learning, which was explicitly representing the return distribution. Although it worked well, using an explicit representation of the return distribution created an accuracy bottleneck that was hard to overcome. It was also difficult to inject risk-sensitivity into the algorithm.</p>
<p>Using an implicit distribution instead allowed us to get around those two problems, giving us much greater representational power and allowing us much greater control over how our agent handles risk.</p>
<p>Of course, there's always room for improvement. Small techniques like using prioritized experience replay and n-step returns for calculating TD error can be used to make the IQN algorithm more powerful. And since distributional RL is still a pretty new field, there will no doubt be major improvements coming down the academia pipeline to be on the lookout for.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Footnotes">Footnotes<a class="anchor-link" href="#Footnotes">¶</a></h1><p><sup>1</sup> see paper #1, section 6.1 for a short discussion of what the paper authors call 'chattering'</p>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
arXiv highlights July 14-20 20192019-07-21T00:00:00-04:002019-07-21T00:00:00-04:00Daniel Coxtag:computable.ai,2019-07-21:/articles/2019/Jul/21/arxiv-highlights-july-14-20-2019.html<p>Better imitation learning with self-correcting policies by negative sampling.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>This week's highlight is a paper on imitation learning: <a href="https://arxiv.org/abs/1907.05634">Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling</a>, chosen again for pragmatic reasons. The problem my team is currently working on has both reasons for wanting high sample efficiency: training would be prohibitively slow without something to kickstart it, and actions taken in the real world can get expensive.</p>
<p>I know I said I'd be experimenting with shorter, more bite-sized posts, but... next time. (If you want that, you can just stop reading after the "Key intuition" section.)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-problem">The problem<a class="anchor-link" href="#The-problem">¶</a></h1><p>Learning from demonstrations is more difficult than it may seem at first glance. The trouble mainly stems from covariate shift: the input distribution your agent will see in production is very likely to be different than that encountered during training. Many machine learning algorithms have this problem, reinforcement learning algorithms included, but imitation learning has it especially bad, for a simple reason: the expert demonstrations you are attempting to follow necessarily explore a very small subset of the state space. The whole <em>point</em> of them is to stay on good trajectories, meaning bad trajectories never get explored.</p>
<p>This causes two issues:</p>
<ol>
<li>The agent can't in general figure out how to get back into the subset of state space where the expert demonstrations apply, even if it gets only slightly off-course, and</li>
<li>Value functions for states and actions are affected by unseen states, making it very <em>likely</em> that the agent will wander off as soon as it's allowed.</li>
</ol>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Key-intuition">Key intuition<a class="anchor-link" href="#Key-intuition">¶</a></h1><p>The authors solve this problem by pre-training with supervised learning using a loss function that drives down the value of all states outside of those explored in the expert demonstrations $U$, by an amount proportional to their Euclidean distance from the closest state in $U$. In their own words:</p>
<blockquote><p>Consider a state $s$ in the demonstration and its nearby state $\tilde{s}$ that is not in the demonstration. The key intuition is that $\tilde{s}$ should have a lower value than $s$, because otherwise $\tilde{s}$ likely should have been visited by the demonstrations in the first place. If a value function has this property for most of the pair $(s,\tilde{s})$ of this type, the corresponding policy will tend to correct its errors by driving back to the demonstration states because the demonstration states have locally higher values.</p>
</blockquote>
<p>And Figure 1 is a nice visual demonstration:</p>
<p><a href="https://computable.ai/images/VINS_Figure_1.jpeg"><img alt="VINS Figure 1" src="https://computable.ai/images/VINS_Figure_1.jpeg" /></a></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Value-Iteration-with-Negative-Sampling-(VINS)">Value Iteration with Negative Sampling (VINS)<a class="anchor-link" href="#Value-Iteration-with-Negative-Sampling-(VINS)">¶</a></h1><p>Into the weeds now.</p>
<h2 id="Self-correctable-policy">Self-correctable policy<a class="anchor-link" href="#Self-correctable-policy">¶</a></h2><p>The first bit of their algorithm is the definition of their self-correcting policy. It's essentially a formalization of what we said above about $s$ and $\tilde{s}$.</p>
<p>If $s \in U$ (if $s$ is in the expert demonstrations), then $$V(s) = V^{\pi_e}(s) \pm \delta_V$$ ("just what the value would be in the expert demonstrations, plus some error").</p>
<p>But if $s \not\in U$, $$V(s) = V^{\pi_e}(\Pi_U(s)) - \lambda \|s-\Pi_U(s)\| \pm \delta_V$$ (where $\Pi_U$ gives the closest $s \in U$, so $V(s)$ is "the value of the closest $s \in U$, <em>minus the distance to that</em> $s \in U$, plus some error")</p>
<p>Then the induced policy from this value function is $$\pi(s) \triangleq \underset{a: \|a-\pi_{BC}(s)\|\le \zeta}{\operatorname{argmax}} ~V(M(s, a))$$</p>
<p>Where $M(s,a)$ is a learned dynamical model of the environment that gives the next state given the current state and action. $\pi_{BC}(s)$ is the "behavioral clone" policy from the expert demonstrations.</p>
<h2 id="RL-algorithm">RL algorithm<a class="anchor-link" href="#RL-algorithm">¶</a></h2><p>To actually achieve $V(M(s,a))$ with the necessary properties, they select a state $s$ from the demonstrations, perturb it a bit to get $\tilde{s}$ nearby, and use the original state $s$ to approximate $\Pi_U(\tilde{s})$ in the following loss function.</p>
$$\mathcal{L}_{ns}(\phi)= \mathbf{E}_{s \sim \rho^{\pi_e}, \tilde{s} \sim perturb(s)} \left(V_{\bar \phi}(s) - \lambda \|s-\tilde{s}\|- V_\phi(\tilde{s}) \right)^2$$<p>Finally, here's the algorithm that uses this and the earlier policy definition:</p>
<p><img src="https://computable.ai/images/VINS_Algorithm_2.jpeg#center" alt="VINS Algorithm 2"></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I thought it was quite strange that they learned $V(s)$ and a dynamical model $M(s,a)$, and then used $V(M(s,a))$ in the algorithm. I thought, "Why not just learn $Q$?" The answer was given in their Section A appendix, and was quite interesting. I'm not sure it applies to our case, but it's important. TL;DR $Q(s,a)$ learned from demonstrations <em>alone</em> is degenerate, because there's always a $Q$ that perfectly matches the demonstrations <em>and doesn't depend at all on</em> $a$. </li>
<li>One of my coworkers (and upcoming Computable author!) wondered to me if the induced policy could be made explicit, by explicitly training a policy network to bring the agent back into safe territory. It could be trained with gradient descent, because $V(M(s,a))$ are just networks, and the technique for training deterministic policies just follows the gradient of the $Q$ function. I wonder too.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Look at This: Where We See Shapes, AI Sees Textures2019-07-16T00:00:00-04:002019-07-16T00:00:00-04:00Daniel Coxtag:computable.ai,2019-07-16:/articles/2019/Jul/16/look-at-this-where-we-see-shapes-ai-sees-textures.html<p>CNNs trained in "the usual way" tend to learn something different than you might expect. They learn to recognize textures (local structure) rather than shapes (global structure).</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="New-Series">New Series<a class="anchor-link" href="#New-Series">¶</a></h1><p><img src="http://weknowmemes.com/wp-content/uploads/2011/12/look-at-this-duck.jpg#right" style="margin-left:15px" width="350" height="268" /></p>
<p>We're starting a simple new series called Look at This, where we briefly plug an article that taught us something.</p>
<p>Our first highlight will be a Quanta article about what CNNs learn when trained in "the usual way":</p>
<p><a href="https://www.quantamagazine.org/where-we-see-shapes-ai-sees-textures-20190701/">Where We See Shapes, AI Sees Textures</a></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Textures,-not-shapes">Textures, not shapes<a class="anchor-link" href="#Textures,-not-shapes">¶</a></h1><p>Training a CNN for object recognition typically involves only showing the algorithm many examples of images that contain or don't contain a target object. Humans also need to see many examples of various objects to get the basic idea. Humans, however, seem to have a bias towards recognition by <em>shape</em> which is missing from CNNs in general.</p>
<blockquote><p>Geirhos, Bethge and their colleagues created images that included two conflicting cues, with a shape taken from one object and a texture from another: the silhouette of a cat colored in with the cracked gray texture of elephant skin, for instance, or a bear made up of aluminum cans, or the outline of an airplane filled with overlapping clock faces. Presented with hundreds of these images, humans labeled them based on their shape β cat, bear, airplane β almost every time, as expected. Four different classification algorithms, however, leaned the other way, spitting out labels that reflected the textures of the objects: elephant, can, clock.</p>
</blockquote>
<p>This is a problem worth solving, since the addition of even a small amount of noise can throw off CNN-based classifiers, where humans aren't fooled. "Adversarial examples" even do this maliciously, adding exactly the right amount of noise to cause misclassification. So how to fix this?</p>
<blockquote><p>Geirhos wanted to see what would happen when the team forced their models to ignore texture. The team took images traditionally used to train classification algorithms and βpaintedβ them in different styles, essentially stripping them of useful texture information. When they retrained each of the deep learning models on the new images, the systems began relying on larger, more global patterns and exhibited a shape bias much more like that of humans.</p>
</blockquote>
<p><img src="https://d2r55xnwy6nx47.cloudfront.net/uploads/2019/07/AI_Textures_2880x1220_LHPA.jpg" alt="Images painted with alien textures"></p>
<p>There were many other insights in this relatively short article, and I commend it to you. It enriched my understanding of what's going on in neural networks, and how far we still need to go to reach parity with humans.</p>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
arXiv highlights July 7-13 20192019-07-14T00:00:00-04:002019-07-14T00:00:00-04:00Daniel Coxtag:computable.ai,2019-07-14:/articles/2019/Jul/14/arxiv-highlights-july-7-13-2019.html<p>Way Off-Policy Batch DRL using a generative model of pre-recorded trajectories, and bias correction.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>Only one paper this week, <em>not</em> because <a href="https://arxiv.org/abs/1905.04819">others</a> failed to catch my eye, but for brevity. Let me know in the comments if you agree that shorter or more focused articles are more attractive. So this week I'll be examining just one paper: <a href="https://arxiv.org/abs/1907.00456">Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog</a>. As with last week's papers, this week's is interesting to me professionally. Batch DRL is a way to solve the sample efficiency problem, from a certain perspective. It's mostly the online learning that costs too much when sample efficiency is low, so solving the problems that come with attempting to train offline might allow us to do many of the same things we could do if we had high online sample efficiency.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="RL-for-open-domain-dialog-generation">RL for open-domain dialog generation<a class="anchor-link" href="#RL-for-open-domain-dialog-generation">¶</a></h1><p>The author's domain is dialog generation. They want to build a better chat bot, and they have quite a few recorded conversations. RL is good at refining these processes, but has a cold-start problem, plus they would certainly prefer to make use of the data they have on-hand. For this, they need to be able to make use of offline data, hence "<em>Way</em> Off-Policy". This data is so off-policy it wasn't even <em>generated</em> by a policy.</p>
<p>So they want to train DRL from samples acquired from some other control of the system (in their case, human interaction data), much like <a href="https://arxiv.org/abs/1704.03732">Deep Q-learning from Demonstrations</a>. There are a couple of reasons this is important for others such as myself:</p>
<blockquote><p>First, since collecting real-world interaction data can be expensive and time-consuming, algorithms must be able to leverage off-policy data - collected from vastly different systems, far into the past - in order to learn.</p>
<p>Second, it is often necessary to carefully test a policy before deploying it to the real world; for example, to ensure its behavior is safe and appropriate for humans. Thus the algorithm must be able to learn offline first, from a static batch of data, without the ability to explore</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="A-generative-model-+-Q-learning">A generative model + Q learning<a class="anchor-link" href="#A-generative-model-+-Q-learning">¶</a></h1><p>The authors first pre-train a generative model on the distribution of collected trajectories, and initialize the Q networks from this model. They then sample a fixed number of actions from it, and output the one with the highest Q-value as their policy's decision. In later reinforcement learning, they penalize their model for KL-divergence from this distribution.</p>
<blockquote><p>To perform batch Q-learning, we first pre-train a generative model of $p(a|s)$ using a set of known environment trajectories. In our case, this model is then used to generate the batch data via human interaction. The weights of the Q-network and target Q-network are initialized from the pre-trained model, which helps reduce variance in the Q-estimates and works to combat overestimation bias. To train $Q_{ΞΈ_Ο}$ we sample < $s_t$, $a_t$, $r_t$, $s_{t+1}$ > tuples from the batch, and update the weights of the Q-network to approximate Eq. 1. This forms our baseline model, which we call Batch Q</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Overestimation-bias">Overestimation bias<a class="anchor-link" href="#Overestimation-bias">¶</a></h1><blockquote><p>Most deep RL algorithms fail to learn from data that is not heavily correlated with the current policy. Even models based on off-policy algorithms lik Q-learning fail to learn when the model is not able to explore during training. This is due to the fact that such algorithms are inherently optimistic in the face of uncertainty.</p>
</blockquote>
<p>If youβre taking the <code>max</code> of something (as in Bellman-equation-based algorithms), then the higher the variance, the higher the <code>max</code> value. This causes an over-estimation bias. We may have seen a really high value for some state once, so now we over-value that state, despite it being atypical. It may not be immediately obvious why this is a <em>problem</em>, but which states are we likely to overvalue? Precisely the states we haven't visited often. Why is <em>that</em> a problem? This sounds good for exploration, right? But if we're trying to train our agent with canned data, it's important that the live agent stick pretty close to the states where the canned data does well, and it's counter-productive to have it believe that everywhere <em>but</em> the pre-explored state space is worth exploring.</p>
<p>A popular solution to the overestimation problem in Q-learning algorithms is to train <em>two</em> Q networks on the same data, put the input through both, and take the minimum value. This helps with the bias because they'll likely disagree unless we can be really <em>certain</em> of the value of the input, and if they disagree we can go with the least confident. The authors of the current paper take a different tack, training a single neural net with dropout, and using the disagreement with different dropout masks as an estimate of uncertainty.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li><p>I didn't talk much about their model architecture, which is "Variational Hierarchical Recurrent Encoder Decoder (VHRED)", largely because I think if I ever tried to make use of this directly I would employ transformers instead. They do mention that transformer architectures are a "powerful alternative", but they chose to work with hierarchical architectures so they could extend their work to hierarchical control in the future. That's interesting. In my own work at the moment, the important thing is the "way off-policy" part, not so much the chat bot part.</p>
</li>
<li><p>It's very interesting to me that both of the methods for correcting overestimation bias make use of uncertainty estimators that I've seen mentioned elsewhere:</p>
<ul>
<li><a href="https://arxiv.org/abs/1905.09638">Estimating Risk and Uncertainty in Deep Reinforcement Learning</a></li>
</ul>
<blockquote><p>...we show that the disagreement between only two neural networks is sufficient to produce a low-variance estimate of the epistemic uncertainty on the return distribution, thus providing a simple and computationally cheap uncertainty metric.</p>
</blockquote>
<ul>
<li><a href="https://arxiv.org/abs/1506.02142">Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning</a></li>
</ul>
<blockquote><p>...we develop a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes. A direct result of this theory gives us tools to model uncertainty with dropout NNs</p>
</blockquote>
</li>
<li><p>This article wasn't really shorter than if I had done multiple papers, less deeply. I'll have to practice at that, not least because it's time-consuming, but information is valuable. How does Adrian Colyer do this every <em>day</em>?</p>
</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
arXiv highlights July 1-6 20192019-07-07T00:00:00-04:002019-07-07T00:00:00-04:00Daniel Coxtag:computable.ai,2019-07-07:/articles/2019/Jul/07/arxiv-highlights-july-1-6-2019.html<p>Beginning a new series highlighting a few interesting RL papers on the arXiv each week. This week: Simple curriculum learning, learning to interact with humans, and warm starting RL with propositional logic.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="New-series">New series<a class="anchor-link" href="#New-series">¶</a></h1><p>This post begins a weekly series highlighting one or more RL papers in the previous week's cs.AI arXiv stream that caught my eye (making no guarantees about the correlation between what catches my eye and what ultimately turns out to be useful, important, etc). I'll be prioritizing sustainability over most other factors, but I do hope to show you some code from time to time.</p>
<p>I read these papers to differing degrees as I have time, so there will likely be some variability in descriptive volume. However, I do pledge to make only justified statements about them so far as I know, and I welcome errata in the comments. I'm still experimenting with the format and voice, so please leave me feedback early and often to influence the series.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>All of this week's papers piqued my interest because of the sample-efficiency problem in modern DRL. Reinforcement learning algorithms need to interact with the environment quite a bit before they become good at a task, and anything that can shorten this time is of interest. My group is currently working on a learning task with a very low sample rate, so we are actively on the hunt for anything that improves sample efficiency.</p>
<ul>
<li><a href="https://arxiv.org/abs/1906.12266">Growing Action Spaces</a>, by Farquhar et al. at Oxford and Facebook AI Research.</li>
<li><a href="https://arxiv.org/abs/1906.10187v2">Learning to Interactively Learn and Assist</a>, by Woodward et al. at Google Brain.</li>
<li><a href="https://arxiv.org/abs/1902.06007v2">ProLoNets: Neural-encoding Human Experts' Domain Knowledge to Warm Start Reinforcement Learning</a>, by Silva et al. at Georgia Institute of Technology.</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Growing-Action-Spaces">Growing Action Spaces<a class="anchor-link" href="#Growing-Action-Spaces">¶</a></h2><p>Growing Action Spaces proposes a form of "curriculum learning", where a more complex task is broken down into a sequence of simpler tasks, sometimes by humans, sometimes automatically. In this case, the authors improved the learning speed of their agent by initially giving it fewer actions to work with, training for a while, and then alternating between giving it more actions to work with and training.</p>
<p>Interestingly, they were working in Starcraft, which is a real-time strategy (RTS) game, where you have to control multiple units simultaneously in a coordinated fashion to achieve some goal. Thus, in their domain, the size of the action space didn't just come from continuity or a really large discrete action space, but from the fact that the actions they were capable of taking were <em>combinatorial</em>. That is, they had to train an agent to take actions from a space including any combination of primitive actions, as well as any combinations of units; a daunting task.</p>
<p>Their solution is brilliant, and highly general: The authors broke the action space up into a hierarchy of action spaces by grouping units, and requiring that the same action be taken by all units within the same group. Then as training progressed, more groups were allowed to act independently. This resulted in a tractable problem at each stage of training, and overall high-performance policies that would have been prohibitively complex with conventional DRL algorithms.</p>
<p>If you or I want to apply this method to our own problems, the key requirement is to come up with a suitable way of breaking large action spaces into hierarchies of progressively smaller ones.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Learning-to-Interactively-Learn-and-Assist">Learning to Interactively Learn and Assist<a class="anchor-link" href="#Learning-to-Interactively-Learn-and-Assist">¶</a></h2><p>Reinforcement learning typically depends on a sparse reward signal and random exploration, both of which contribute to poor sample efficiency in modern algorithms. One method of improving sample efficiency and solving the exploration problem is imitation learning, where the agent is pre-trained to mimic expert behavior. However, expert demonstrations are expensive, and it's often difficult to know how much and of what kind will suffice. These are the problems Learning to Interactively Learn and Assist attempts to solve by proposing a different paradigm entirely: without explicit demonstrations or reward function.</p>
<p>The goal is for an agent and a "principal" (say, a human) to learn to work together to accomplish the principal's purpose. The agent takes its cues from the principal's behavior, and acts helpfully. This requires prior understanding, both of the environment and of what constitutes communication from the principal.</p>
<p>To get to this point, the authors trained an agent jointly with a "human surrogate" principal on a variety of tasks in the same environment. Each time, the principal knows the task (as part of its observation input), and the agent does not. They receive a joint reward at the end of the episode.</p>
<blockquote><p>By informing the principal of the current task and withholding rewards and gradient updates until the end of each task, the agents are encouraged to emerge interactive learning behaviors in order to inform the assistant of the task and allow them to contribute to the joint reward.</p>
</blockquote>
<p>Prior domain knowledge required to jointly accomplish a given task is trained into the agent ahead of time this way, along with the methods of communication. Actions and observations are restricted to the environment, so that later the principal may be replaced with a human.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="ProLoNets:-Neural-encoding-Human-Experts'-Domain-Knowledge-to-Warm-Start-Reinforcement-Learning">ProLoNets: Neural-encoding Human Experts' Domain Knowledge to Warm Start Reinforcement Learning<a class="anchor-link" href="#ProLoNets:-Neural-encoding-Human-Experts'-Domain-Knowledge-to-Warm-Start-Reinforcement-Learning">¶</a></h2><p>ProLoNets stands for "Propositional Logic Nets", which are a neural network architecture and method of initialization that allows a domain expert to encode initial behavior for a DRL agent in the form of propositional logic.</p>
<p>To give you the flavor:</p>
<blockquote><p>To illustrate this more practically, we consider the simplest case of a cart pole ProLoNet with a single decision node. Assume we have solicited the following from a domain expert: "If the cart's $x$ position is right of center, move left; otherwise, move right," and that they indicate <code>x_position</code> is the first input feature and that the center is at 0. We therefore initialize our primary node $D_0$ with $w_0=[1,0,0,0]$ and $c_0=0$. We then specify $l_0$ to be a new leaf with a prior of $[1,0]$. Finally, we set the path to $l_0$ to be $D_0$ and the path $l_1$ to be $(1-D_0)$. Consequently for each state, the probability distribution over the agent's two actions is a softmax over $(D_0*l_0+(1-D_0)*l_1)$</p>
</blockquote>
<p>I've barely skimmed this paper so I don't know what each of the components means, but I gather that a human-authored decision tree can be translated directly into a correctly-initialized neural network architecture, and an actor-critic algorithm takes over from there to improve beyond the human expert's baseline.</p>
<p>Something else that caught my eye:</p>
<blockquote><p>While our initialized ProLoNets are able to follow expert strategies immediately, they may lack expressive capacity to learn more optimal policies once they are deployed into a domain. ... To enable the ProLoNet architecture to continue to grow beyond its initial definition, we introduce a dynamic deepening procedure.</p>
<p>Upon initialization, a ProLoNet agent maintains two copies of its actor: the shallower, unaltered initialized version and a deeper version, in which each leaf is transformed into a randomly initialized node with two new randomly initialized leaves. As the agent interacts with its environment, it relies on the shallower networks to generate actions and value predictions and to gather experience, After each episode, our off-policy update is run over the shallower and deeper networks. Finally, after the off-policy updates, the agent compares the entropy of the shallower actor's leaves to the entropy of the deeper actor's leaves and selectively deepens when the leaves of the deeper actor are less uniform than those of the shallower actor. We find that this dynamic deepening improves stability and ameliorates policy degradation.</p>
</blockquote>
<p>This strikes me as the beginning of the future, where neural network architecture is learned and adjusted dynamically alongside the network parameters.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I'm extremely pleased to have finally gotten this off the ground. Please comment on anything and everything, and we'll drive this thing together.</li>
<li>Growing Action Spaces is immediately relevant to my group, since in the medium-term, we intend to increase our action spaces combinatorially, and will inherit all of the trouble this brings. More on this another time.</li>
<li>I wonder how often in complex real environments the "Learning to Interactively Learn and Assist" agents will learn to communicate in a way that humans find unintuitive. Since the quickest way to communicate involves some compression, would we need to add some term representing human understandability? How best to do this?</li>
<li>"Learning to Interactively Learn and Assist" seems like a relevant paper for AI safety, though as far as I could tell in my quick read, it wasn't billed that way. If we train agents that don't have goals of their own necessarily, but take their cues from us in real time, are we safer than if we attempted to craft the perfect reward function, or demonstrated our desires in a one-and-done fashion?</li>
<li>I've gotta actually read the ProLoNets paper. There was even more to it than I highlighted, and they included an ablation study which will likely tell me if I can incorporate their concepts piecemeal into my own work.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Boltzmann Machines: Differentiation Work2019-03-10T00:00:00-05:002019-03-10T00:00:00-05:00Daniel Coxtag:computable.ai,2019-03-10:/articles/2019/Mar/10/boltzmann-machines-differentiation-work.html<p>My differentiation work while reading Ilya Sutskever on the biological plausibility of Boltzmann machines.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>I recently read <a href="https://theneural.wordpress.com/2011/07/08/the-miracle-of-the-boltzmann-machine/">The Miracle of the Boltzmann Machine</a>, and it's so compelling that I've been thinking about it ever since. I intend to write much more on Boltzmann Machines in the future, but here I'm just going to show my work differentiating the objective function.</p>
<h3 id="Given">Given<a class="anchor-link" href="#Given">¶</a></h3><ol>
<li>Objective function $$L(W) := \mathbb{E}_{D(V)} [log P(V)]$$</li>
<li>and probability of a given BM state $X=(V,H)$ $$P(X) := P(V,H) := {e^{X^TWX/2}\over {\sum_{X'} e^{X'^TWX'/2}}}$$
$$P(V) := \sum_H P(V,H) = \frac{\sum_H e^{X^TWX/2}}{\sum_{X'} e^{X'^TWX'/2}}$$ where $W$ is the BM transition matrix, assuming $w_{ij}=w_{ji}$</li>
</ol>
<h3 id="Want-to-show">Want to show<a class="anchor-link" href="#Want-to-show">¶</a></h3>$$\frac{\partial L}{\partial w_{ij}} = \mathbb{E}_{D(V)P(H|V)}[x_ix_j]-\mathbb{E}_{P(V,H)}[x_ix_j]$$<h3 id="Proof">Proof<a class="anchor-link" href="#Proof">¶</a></h3><ol>
<li>Definition of expected value $$L(W)=\mathbb{E}_{D(V)} [\log P(V)] = \sum_V D(V)\log P(V)$$</li>
<li>Let $f = logP(V)$ $$\frac{\partial L}{\partial f} = \sum_V D(V)\frac{\partial f}{\partial w_{ij}}$$</li>
<li>Chain rule $$\frac{\partial f}{\partial w_{ij}} = {\frac{\partial P(V)}{\partial w_{ij}} \over P(V)}$$</li>
<li>Expand $P(V)$ $$\frac{\partial P(V)}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}}\left[\sum_H P(V,H)\right] = \frac{\partial}{\partial w_{ij}}\left[\sum_H {e^{X^TWX/2}\over {\sum_{X'} e^{X'^TWX'/2}}}\right] = \sum_H \frac{\partial}{\partial w_{ij}}\left[{e^{X^TWX/2}\over {\sum_{X'} e^{X'^TWX'/2}}}\right]$$</li>
<li>Quotient rule $$\frac{\partial P(V)}{\partial w_{ij}} =\sum_H \frac{\frac{\partial}{\partial w_{ij}}\left[e^{X^TWX/2}\right]{\sum_{X'} e^{X'^TWX'/2}}-e^{X^TWX/2} \frac{\partial}{\partial w_{ij}}\left[{\sum_{X'} e^{X'^TWX'/2}}\right]}{\left({\sum_{X'} e^{X'^TWX'/2}}\right)^2}$$</li>
<li>Chain rule, and notice $\frac{\partial}{\partial w_{ij}}\left[W\right]$ is $0$ everywhere except $w_{ij}$, so $$\frac{\partial}{\partial w_{ij}}\left[e^{X^TWX/2}\right] = \frac{\partial}{\partial w_{ij}}\left[X^TWX/2\right] e^{X^TWX/2} = x_ix_je^{X^TWX/2}$$</li>
<li>So #5 becomes $$\frac{\partial P(V)}{\partial w_{ij}} = \sum_H \frac{x_ix_je^{X^TWX/2}{\sum_{X'} e^{X'^TWX'/2}}-e^{X^TWX/2} \sum_{X'}x'_ix'_je^{X'^TWX'/2}}{\left({\sum_{X'} e^{X'^TWX'/2}}\right)^2}$$</li>
<li>Separating terms $$\frac{\partial P(V)}{\partial w_{ij}} = \sum_H\left[\frac{x_ix_je^{X^TWX/2}{\sum_{X'} e^{X'^TWX'/2}}}{\left({\sum_{X'} e^{X'^TWX'/2}}\right)^2}\right]-\sum_H\left[\frac{e^{X^TWX/2} \sum_{X'}x'_ix'_je^{X'^TWX'/2}}{\left({\sum_{X'} e^{X'^TWX'/2}}\right)^2}\right]$$</li>
<li>Cancelling and moving factors outside sums $$\frac{\partial P(V)}{\partial w_{ij}} = \sum_H\left[\frac{x_ix_je^{X^TWX/2}}{{\sum_{X'} e^{X'^TWX'/2}}}\right]-\frac{\sum_H\left[e^{X^TWX/2}\right] \sum_{X'}x'_ix'_je^{X'^TWX'/2}}{\left({\sum_{X'} e^{X'^TWX'/2}}\right)^2}$$</li>
<li>Definition of $P(V,H)$ and $P(V)$ $$\frac{\partial P(V)}{\partial w_{ij}} = \sum_H\left[x_ix_jP(V,H)\right]-P(V) \sum_{X'}\left[x'_ix'_jP(V',H')\right]$$</li>
<li>Substituting #10 into #3 and #3 into #2 we have $$\frac{\partial L}{\partial w_{ij}} = \sum_VD(V)\left[\frac{\sum_H\left[x_ix_jP(V,H)\right]-P(V) \sum_{X'}\left[x'_ix'_jP(V',H')\right]}{P(V)}\right]$$</li>
<li>Separating into two terms $$\frac{\partial L}{\partial w_{ij}} = \sum_V\left[D(V)\sum_H\left[\frac{x_ix_jP(V,H)}{P(V)}\right]\right]-\sum_V\left[D(V)P(V)\sum_{X'}\left[x'_ix'_jP(V',H')\right]\right]$$</li>
<li>Definition of conditional probability $$\frac{\partial L}{\partial w_{ij}} = \sum_V\sum_H\left[x_ix_jD(V)P(H|V)\right]-\sum_VD(V)\sum_{X'}\left[x'_ix'_jP(V',H')\right]$$</li>
<li>$\sum_VD(V)=1$, combining sums, and $X=(V,H)$ $$\frac{\partial L}{\partial w_{ij}} =\sum_{(V,H)}\left[x_ix_jD(V)P(H|V)\right]-\sum_{(V',H')}\left[x'_ix'_jP(V',H')\right]$$</li>
<li>Definition of expected value $$\frac{\partial L}{\partial w_{ij}} = \mathbb{E}_{D(V)P(H|V)}[x_ix_j]-\mathbb{E}_{P(V,H)}[x_ix_j]$$ $\square$</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Inaugural Post2019-02-16T00:00:00-05:002019-02-16T00:00:00-05:00Daniel Coxtag:computable.ai,2019-02-16:/articles/2019/Feb/16/inaugural-post.html<p>The purpose statement and introduction to Computable AI.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This post begins the Computable AI blog, a machine intelligence blog from a handful of DRL practitioners, intended to crystalize, internalize, share, and explain.</p>
<p>I found few beginner resources for DRL when I began, and since I have a passion for teaching, this seemed a likely area in which to make a dent.</p>
<p>I also serve as the "Director of Applied Sciences" for a startup software company, and the AI team must occasionally indoctrinate new members. This provides us with a convenient target audience, as well as an expanding pool of co-authors.</p>
<p>Finally, my own education in DRL is incomplete, so this will serve partly as a record of my own journey.</p>
<p>I hope it helps you.</p>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>