Computable AI - arXiv highlightshttps://computable.ai/2019-11-03T00:00:00-04:00A Machine Intelligence BlogCox's Theorem: Establishing Probability Theory2019-11-03T00:00:00-04:002019-11-03T00:00:00-04:00Daniel Coxtag:computable.ai,2019-11-03:/articles/2019/Nov/03/coxs-theorem-establishing-probability-theory.html<p>Cox's theorem is the strongest argument for the use of standard probability theory. Here we examine the axioms to establish a firm foundation for the interpretation of probability theory as the unique extension of true-false logic to degrees of belief.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Ranging-farther-afield">Ranging farther afield<a class="anchor-link" href="#Ranging-farther-afield">¶</a></h1><p>Today I'll be taking advantage of my stated intention to pull back from the stream of <em>recent</em> papers, and look at some papers for their impact or fundamental importance as I see it. So today I'm doing something unusual, highlighting a paper not from last week, but from <em>four years</em> ago, and not directly from AI, but from the field of probability theory: <a href="https://arxiv.org/abs/1507.06597">Cox's Theorem and the Jaynesian Interpretation of Probability</a>.</p>
<p>I've been reading a book by E. T. Jaynes, called <a href="https://www.amazon.com/Probability-Theory-Science-T-Jaynes/dp/0521592712">Probability Theory: The Logic of Science</a>, a brilliant and practical exposition of the Bayesian view of probability theory, partially on <a href="https://www.lesswrong.com/posts/kXSETKZ3X9oidMozA/the-level-above-mine">the recommendation of another AI researcher</a>. The thoughts of an ideal reasoner would have Bayesian structure, so I am both personally and professionally interested in mastering the concepts.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Overview">Overview<a class="anchor-link" href="#Overview">¶</a></h1><p>Cox's theorem is an attempt to derive probability theory from a small, common-sense set of uncontroversial desiderata, and to demonstrate its uniqueness as an extension of two-valued (true/false) logic to degrees of belief. That's a big deal. As today's paper mentions, Peter Cheeseman <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.1988.tb00091.x">has called</a> Cox's theorem the "strongest argument for the use of standard (Bayesian) probability theory". But Cox's theorem is non-rigorous as originally formulated, and many people have patched up the holes for use in their various fields. Often today, if someone refers to "Cox's theorem", they usually mean one of the fixed-up versions.</p>
<p>Jaynes' version unfortunately contains a mistake, and today's paper fixes it by replacing some of the axioms with the simple requirement that probability theory remain consistent with respect to repeated events.</p>
<p>It may be difficult without reading the book to see why this paper is important to AI, so perhaps in the near future I'll discuss that at greater length. For today, however, I'll simply be explaining each of the axioms, and setting you up to read the paper more easily. It is certainly worth a close reading, to ground your confidence in the interpretation of probability theory as a <em>logical system</em> that extends true-false logic to handle uncertainty, so you can reap the associated benefits.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Abstract">Abstract<a class="anchor-link" href="#Abstract">¶</a></h1><blockquote><p>There are multiple proposed interpretations of probability theory: one such interpretation is true-false logic under uncertainty. Cox's Theorem is a representation theorem that states, under a certain set of axioms describing the meaning of uncertainty, that every true-false logic under uncertainty is isomorphic to conditional probability theory. This result was used by Jaynes to develop a philosophical framework in which statistical inference under uncertainty should be conducted through the use of probability, via Bayes' Rule. Unfortunately, most existing correct proofs of Cox's Theorem require restrictive assumptions: for instance, many do not apply even to the simple example of rolling a pair of fair dice. We offer a new axiomatization by replacing various technical conditions with an axiom stating that our theory must be consistent with respect to repeated events. We discuss the implications of our results, both for the philosophy of probability and for the philosophy of statistics.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Axioms-$\newcommand{\P}{\mathbb{P}}-\newcommand{\F}{\mathscr{F}}$">Axioms $\newcommand{\P}{\mathbb{P}} \newcommand{\F}{\mathscr{F}}$<a class="anchor-link" href="#Axioms-$\newcommand{\P}{\mathbb{P}}-\newcommand{\F}{\mathscr{F}}$">¶</a></h1><p>This paper proposes a new axiomatization of probability theory, with five axioms. As a variant of Cox's theorem, these axioms are supposed to represent a set of "common sense" desiderata for a logical system under uncertainty. That is, each of these axioms are things we naturally want to be true of any logical system under uncertainty. Cox's original axioms were more intuitively essential to me, however, so I'll also try to give justifications for demanding each of the following axioms, as well as explaining them technically.</p>
<p>Remember the ultimate goal is to <em>build</em> probability theory up from a minimal set of absolute requirements for <em>any</em> logical system. The punchline is that probability theory as described historically by greats like Kolmogorov turns out to be the <em>unique</em> extension of true-false logic under uncertainty, and we can derive it from "common sense".</p>
<p>To emphasize the point that while we're writing these axioms we haven't yet got <em>probability</em>, following Jaynes I'll refer to our measure of certainty/uncertainty as "plausibility".</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="1.-Plausibility-must-be-representable-by-a-real-number">1. Plausibility must be representable by a real number<a class="anchor-link" href="#1.-Plausibility-must-be-representable-by-a-real-number">¶</a></h2><blockquote><p>Let $\Omega$ be a set and $\mathscr{F}$ be a $\sigma$-algebra on $\Omega$.</p>
<p>Let $\P: \F \times (\F \setminus \emptyset) \rightarrow R \subseteq \mathbb{R}$ be a function, written using notation $\P(A|B)$.</p>
</blockquote>
<p>It makes intuitive sense that we should be able to measure our uncertainty on a smooth, finite scale, so it makes sense to demand that our plausibility scale be chosen from some definite subset of the reals.</p>
<p>$\F$ being "<a href="https://en.wikipedia.org/wiki/Sigma-algebra">a $\sigma$-algebra on $\Omega$</a>" means that it is the set of every subset of $\Omega$ (including $\Omega$ and $\emptyset$), is closed under complement, and is closed under countable unions. (Being "closed under" some operation means that taking that operation on any element in the set yields an element that's also defined to be in the set.) The idea is that $\Omega$ comprises all primitive events, and $\F$ therefore includes every possible logical combination of these primitive events, in a way that makes it eqivalent to a Boolean algebra.</p>
<p>I found it clarifying that $\P(\Omega)=1$. That's what made it click for me that a set in $\F$ represents a disjunction of primitive events, and $\Omega$ contains <em>all</em> primitive events, so $\P(\Omega)$ is the probability that <em>anything</em> happens.</p>
<p>$\P(A|B)$ is a function of two arguments $A,B \in \mathscr{F}$, and B cannot be empty. The interpretation is, "The probability of some event A, given that event B is true." The second argument cannot be empty, Jaynes often describes it as "the background information", including everything else known (such as the rules of probability themselves, and the number of penguins in Antarctica).</p>
<p>The arguments of $\P$ are sets, but as the paper mentions, "by <a href="https://www.jstor.org/stable/1989664">Stone's Representation Theorem</a>, every Boolean algebra is isomorphic to an algebra of sets".</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="2.-Sequential-continuity">2. Sequential continuity<a class="anchor-link" href="#2.-Sequential-continuity">¶</a></h2><blockquote><p>We have that
$$A_1 \subseteq A_2 \subseteq A_3 \subseteq\ldots \text{ such that } A_i \nearrow A \text{ implies } \P (A_i | B)\nearrow \P(A | B )$$
for all $A, A_i, B$.</p>
</blockquote>
<p>Another intuitive requirement for a system of logical inference is that our plausibility measure return arbitrarily small differences in plausibility for arbitrarily small changes in truth value. This concept is also known as "continuity".</p>
<p>If you can arrange a sequence of events (sets) so that earlier events (e.g., $A_1$) are included in later events (e.g., $A_3$), then there is "sequential continuity" between earlier sets and later sets in this sequence. In the notation of the paper, $A_1 \nearrow A_3$.</p>
<p>What this axiom is saying is that as long as there is sequential continuity between two logical propositions, there is also sequential continuity between their plausibilities. This formalizes our requirement for continuity. Also notice that if $\P (A_i | B)\nearrow \mathbb{P}(A | B )$ then $\P (A_i | B) \leq \mathbb{P}(A | B )$, because our definition of sequential continuity also implies that the cardinality of the sets is non-decreasing. This will be useful reading the proof.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="3.-Decomposability">3. Decomposability<a class="anchor-link" href="#3.-Decomposability">¶</a></h2><blockquote><p>$\P(AB | C )$ can be written as
$$\P(A | C ) \circ \P(B | AC)$$
for some some function $\circ : (R \times R) \rightarrow R$.</p>
</blockquote>
<p>This is the first axiom that I had trouble seeing as intuitive, and in fact I thought it was a bit question-begging at first because it looks like the product rule. It represents the demand that plausibilities of compound propositions be decomposable into plausibilities of the their constituents, and that that decomposition has a particular form. It's the demand that it follow a particular form that seems somewhat arbitrary to me at first. Of course we would want to be able to decompose compound uncertainty into more fundamental elements, or else probability theory wouldn't be very useful. But why should it take the form described of $\circ$?</p>
<p>The answer is that this form is <em>minimal</em> for decomposability. That is, it's the weakest statement that could be made about the details of decomposition. In English: "The plausibility of A <em>and</em> B is a function of the plausibility of one of those (say, $A$), and the plausibility of the other ($B$) once we can assume $A$ is true."</p>
<p>Note that logical conjunctions are commutative ($AB = BA$), so by this axiom $\P(AB | C )$ can <em>also</em> be written as $\P(B | C ) \circ \P(A | BC)$. They prove later also that $\circ$ is commutative, but that is not assumed in the axioms.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="4.-Negation">4. Negation<a class="anchor-link" href="#4.-Negation">¶</a></h2><blockquote><p>There exists a function $N : R \rightarrow R$ such that
$$
\P(A^c | B)= N[ \P(A | B)]
$$
for all $A,B$.</p>
</blockquote>
<p>This axiom also seemed a bit question-begging to me, because it looks like the sum rule of probability theory, and because it seemed arbitrary that you would want uniquely determined probabilities for the negations of propositions.</p>
<p>Upon further reflection, however, this seems like a reasonable demand to be consistent with two-valued logic. Every proposition $A$ in true-false logic has a unique proposition $A^c$ representating its negation, (This superscript complement notation emphasizes the representation as propositions as sets, but is equivalent to $\bar A$, $\neg A$, etc.) so it makes sense that an extension of true-false logic to uncertainty would also include a method of determining the opposite.</p>
<p>In actual fact, this <em>may</em> be the most controversial axiom, since there are logics other than true-false logic that don't require the "law of the excluded middle" (they allow "maybe"). But if you are willing to accept that all well-formed propositions are either true or false, and our system of plausibility represents levels of certainty about their truth or falsehood, then this axiom represents a reasonable and necessary demand.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="5.-Consistency-under-extension">5. Consistency under extension<a class="anchor-link" href="#5.-Consistency-under-extension">¶</a></h2><blockquote><p>If $(\Omega, \mathscr{F}, \P)$ satisfies the axioms above, then $(\Omega \times \Omega, \mathscr{F} \otimes \mathscr{F}, \P \operatorname{\circ} \P)$ must as well, i.e., the definition $\P(A \times B | C \times D) = \P(A | C) \circ \P(B | D)$ is consistent.</p>
</blockquote>
<p>This axiom represents the core of the authors' contribution. Although there were many correct variants of Cox's theorem, and many ways to axiomatize probability theory, they all had either disappointingly narrow scope, or had lost their intuitive nature in the formalization. The authors' of our paper replace several technical axioms from other axiomatizations with this one demand <em>that their rules be consistent under extention to repeated events</em>.</p>
<p>In English, this axiom is, "If the rules apply to a single trial (e.g., a single coinflip), then they also apply to a system of two independent trials (e.g., two coinflips)." To me, that's obviously intuitive, so it's delightful to find that it covers so much ground.</p>
<p>Examining their formal expression, with the coinflips example, with $A$ meaning "heads on the first coinflip" and B meaning "tails on the second coinflip":</p>
<p>$\P(A \times B | C \times D)$ means "the plausibility of heads-then-tails given two piles of background information $C$ and $D$". The axiom states this must equal $\P(A | C) \circ \P(B | D)$, meaning that the plausibility of a pair of coinflips coming up heads-tails is equal to the plausibility of a single coinflip coming up heads (given background information $C$), composed (using $\circ$) with another coinflip coming up tails (given background information $D$).</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I hope this exposition of the axioms helps you read the paper yourself, though I realize I may not have provided sufficient motivation to do so yet. That would make it a bit like <a href="https://computable.ai/articles/2019/Mar/10/boltzmann-machines-differentiation-work.html">my post deriving something surprising about Boltzmann machines</a> without first explaining what Boltzmann machines <em>are</em>. I intend to rectify this in the future for both posts.</li>
<li>I could make this a lot clearer for people with less set theory, group theory, or probability theory background. If that would be helpful to you, please leave me a comment on what specifically didn't make sense so I can get a feel for my audience.</li>
<li>To memorize these and make reading the proof easier, I labeled each of the five axioms with some relevant symbol, and combined them into a mneumonic. In case that helps you too, here it is: $\mathbb{R}$ $\nearrow$ $\circ$ $N$ $\times$.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Comments on Eight Abstracts2019-10-06T00:00:00-04:002019-10-06T00:00:00-04:00Daniel Coxtag:computable.ai,2019-10-06:/articles/2019/Oct/06/comments-on-eight-abstracts.html<p>An unfocused sweep of eight abstracts from a very busy week in AI research: Emergent tool use, why hierarchical learning can work so well, brain-inspired hardware for artificial neural networks, pretraining and transfer learning for RL, chromatic network compression, semi-supervised reward shaping, WGAN model imitation for model-based RL, and navigation in turbulent flows!</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-(last)-week">This (last) week<a class="anchor-link" href="#This-(last)-week">¶</a></h1><p>Alas, I bit off more than I could chew last week. You'll see what I mean in a moment. However, I've decided to define the problem away, as part of an effort to more effectively juggle all of my life responsibilities:</p>
<p><strong>ArXiv Highlights will be bi-weekly</strong> from here on out. I'm also going to be a <em>little</em> less strict about when I sample papers from, so that I don't feel so constrained to do "last week's" arXiv announcements. The attentive reader may have noticed that I've already occasionally sampled from outside of the week's announcements, and I'd actually prefer to do that more often so that I can hit <em>key</em> papers instead of just <em>new</em> papers.</p>
<hr>
<p>I couldn't just pick one paper last week, since so many seemed relevant and interesting. Therefore I'm experimenting with yet another format for arXiv highlights: posting all the abstracts, and commenting a bit on each one. The goal is to work each of these concepts into my memory (and yours) so that they'll spring to mind when we need them.</p>
<p>In arXiv announcement order:</p>
<ol>
<li><a href="https://arxiv.org/abs/1909.07528v1">Emergent Tool Use From Multi-Agent Autocurricula</a></li>
<li><a href="https://arxiv.org/abs/1909.10618v1">Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning?</a></li>
<li><a href="https://arxiv.org/abs/1909.11145v1">Brain-Inspired Hardware for Artificial Intelligence: Accelerated Learning in a Physical-Model Spiking Neural Network</a></li>
<li><a href="https://arxiv.org/abs/1909.11373v1">Pre-training as Batch Meta Reinforcement Learning with tiMe</a></li>
<li><a href="https://arxiv.org/abs/1907.06511v2">Reinforcement Learning with Chromatic Networks</a></li>
<li><a href="https://arxiv.org/abs/1907.08225v2">Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery</a></li>
<li><a href="https://arxiv.org/abs/1909.11821v1">Model Imitation for Model-Based Reinforcement Learning</a></li>
<li><a href="https://arxiv.org/abs/1907.08591v2">Zermelo's problem: Optimal point-to-point navigation in 2D turbulent flows using Reinforcement Learning</a></li>
</ol>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="1.-Emergent-Tool-Use-From-Multi-Agent-Autocurricula">1. Emergent Tool Use From Multi-Agent Autocurricula<a class="anchor-link" href="#1.-Emergent-Tool-Use-From-Multi-Agent-Autocurricula">¶</a></h1><blockquote><p>Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.</p>
</blockquote>
<p><a href="https://arxiv.org/abs/1909.07528v1">https://arxiv.org/abs/1909.07528v1</a></p>
<p>I notice that OpenAI, DeepMind, and Google Brain are involved in a lot of the interesting work in reinforcement learning lately, and many of the papers that catch my eye have at least some authors from either of these organizations. I'm an aspiring Bayesian, so it wasn't <em>too</em> long before I starting reading papers <em>because</em> they were authored by one of these organizations.</p>
<p>Anyway, the term "autocurriculum" seems to come from <a href="https://arxiv.org/abs/1903.00742">this DeepMind paper</a>:</p>
<blockquote><p>Here we explore the hypothesis that multi-agent systems sometimes display intrinsic dynamics arising from competition and cooperation that provide a naturally emergent curriculum, which we term an autocurriculum.</p>
</blockquote>
<p>This gives me a word for something I've observed about my young son: The activities he's naturally inclined to engage in at each stage of his development seem uncannily well-suited for teaching him the <em>next</em> thing he should learn. Wanting to put things in his mouth, plus a capacity for boredom, motivated him to develop reaching and grabbing, then crawling, then pathfinding, then complex navigation...</p>
<p>In the case of the OpenAI paper, putting multiple adversarial agents into complex environments and allowing them to learn causes them to learn new behaviors <em>in phases</em>.</p>
<blockquote><p>We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt</p>
</blockquote>
<p>Each time a team of agents learns a dominant strategy, the opposing team is pressured to develop a strategy capable of defeating it, which then pressures the first team to come up with <em>another</em> stretegy to defeat <em>that</em> one, and on and on until a truly dominant strategy emerges.</p>
<p>My takeaway from this is that emergent autocurricula may be another good reason for me to study multi-agent systems.</p>
<p>This paper comes with a nice blog post and a cute video: <a href="https://openai.com/blog/emergent-tool-use/">https://openai.com/blog/emergent-tool-use/</a></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="2.-Why-Does-Hierarchy-(Sometimes)-Work-So-Well-in-Reinforcement-Learning?">2. Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning?<a class="anchor-link" href="#2.-Why-Does-Hierarchy-(Sometimes)-Work-So-Well-in-Reinforcement-Learning?">¶</a></h1><blockquote><p>Hierarchical reinforcement learning has demonstrated significant success at solving difficult reinforcement learning (RL) tasks. Previous works have motivated the use of hierarchy by appealing to a number of intuitive benefits, including learning over temporally extended transitions, exploring over temporally extended periods, and training and exploring in a more semantically meaningful action space, among others. However, in fully observed, Markovian settings, it is not immediately clear why hierarchical RL should provide benefits over standard "shallow" RL architectures. In this work, we isolate and evaluate the claimed benefits of hierarchical RL on a suite of tasks encompassing locomotion, navigation, and manipulation. Surprisingly, we find that most of the observed benefits of hierarchy can be attributed to improved exploration, as opposed to easier policy learning or imposed hierarchical structures. Given this insight, we present exploration techniques inspired by hierarchy that achieve performance competitive with hierarchical RL while at the same time being much simpler to use and implement.</p>
</blockquote>
<p><a href="https://arxiv.org/abs/1909.10618v1">https://arxiv.org/abs/1909.10618v1</a></p>
<p>The big finding here is that "most of the observed benefits of hierarchy can be attributed to improved exploration".</p>
<p>This is not the first time I've heard that a complicated technique in RL has been studied and found to boil down to better exploration or more even coverage of the state space. <a href="https://arxiv.org/abs/1902.10250">Diagnosing Bottlenecks in Deep Q-learning Algorithms</a> contained a similar revelation about replay buffer sampling, for example, and I get the impression that the maximum entropy RL framework seems to be overtaking more ad hoc trust region methods such as PPO. Anyway, this is why theory is important even to mere industry practitioners. As theory catches up to practice, we learn <em>why</em> things work, and the answers are often surprising and useful.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="3.-Brain-Inspired-Hardware-for-Artificial-Intelligence:-Accelerated-Learning-in-a-Physical-Model-Spiking-Neural-Network">3. Brain-Inspired Hardware for Artificial Intelligence: Accelerated Learning in a Physical-Model Spiking Neural Network<a class="anchor-link" href="#3.-Brain-Inspired-Hardware-for-Artificial-Intelligence:-Accelerated-Learning-in-a-Physical-Model-Spiking-Neural-Network">¶</a></h1><blockquote><p>Future developments in artificial intelligence will profit from the existence of novel, non-traditional substrates for brain-inspired computing. Neuromorphic computers aim to provide such a substrate that reproduces the brain's capabilities in terms of adaptive, low-power information processing. We present results from a prototype chip of the BrainScaleS-2 mixed-signal neuromorphic system that adopts a physical-model approach with a 1000-fold acceleration of spiking neural network dynamics relative to biological real time. Using the embedded plasticity processor, we both simulate the Pong arcade video game and implement a local plasticity rule that enables reinforcement learning, allowing the on-chip neural network to learn to play the game. The experiment demonstrates key aspects of the employed approach, such as accelerated and flexible learning, high energy efficiency and resilience to noise.</p>
</blockquote>
<p><a href="https://arxiv.org/abs/1909.11145v1">https://arxiv.org/abs/1909.11145v1</a></p>
<p>This paper was presented at ICANN 2019, and published in Lecture Notes in Computer Science. In case it's unclear what's going on here: The authors built a small-scale prototype (32 neurons, 32 synapses each) of an apparently <em>analog</em> hardware simulation of a biological learning model of the brain (<a href="https://en.wikipedia.org/wiki/Spike-timing-dependent_plasticity">STDP</a>). They then used it to a) simulate a simplified Pong (on-chip), and b) successfully learn to play using reinforcement learning (again, on-chip). Emulating their own system on an Intel i7-4771 was an order of magnitude slower, so we're talking about a real improvement. This is an auspicious beginning, and they hint at scaled-up work to come.</p>
<p>I look forward to specialized neuronal hardware. I'm especially interested to hear that they simulated actual neurons to some degree, with spike-timing dependence, rather than the simplified model that I'm used to working with. I expect this means they intend to simulate actual brains at some point. Stay tuned.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="4.-Pre-training-as-Batch-Meta-Reinforcement-Learning-with-tiMe">4. Pre-training as Batch Meta Reinforcement Learning with tiMe<a class="anchor-link" href="#4.-Pre-training-as-Batch-Meta-Reinforcement-Learning-with-tiMe">¶</a></h1><blockquote><p>Pre-training is transformative in supervised learning: a large network trained with large and existing datasets can be used as an initialization when learning a new task. Such initialization speeds up convergence and leads to higher performance. In this paper, we seek to understand what the formalization for pre-training from only existing and observational data in Reinforcement Learning (RL) is and whether it is possible. We formulate the setting as Batch Meta Reinforcement Learning. We identify MDP mis-identification to be a central challenge and motivate it with theoretical analysis. Combining ideas from Batch RL and Meta RL, we propose tiMe, which learns distillation of multiple value functions and MDP embeddings from only existing data. In challenging control tasks and without fine-tuning on unseen MDPs, tiMe is competitive with state-of-the-art model-free RL method trained with hundreds of thousands of environment interactions.</p>
</blockquote>
<p><a href="https://arxiv.org/abs/1909.11373v1">https://arxiv.org/abs/1909.11373v1</a></p>
<p>This paper attempts to bring the benefits of pretraining (on some pre-recorded batch) to reinforcement learning. This is non-trivial, since Q-learning algorithms are known to be unstable on batches produced by "foreign policy" (my phrase).</p>
<blockquote><p>The value function diverges if Q fails to accurately estimate the value of $\pi(s')$</p>
</blockquote>
<p>This is mitigated by online Q-learning algorithms because the contents of the replay buffer, while produced by an off-policy algorithm, was at least produced through interaction with the environment, and so the distribution of the induced $\pi$ doesn't deviate too much from the distribution in the replay buffer. Even then, this phenomenon is still a source of instability for Q-learning.</p>
<p>In batch learning, the problem is worse. The recorded batch was <em>not</em> produced by our induced policy, and perhaps not even by a <em>single</em> policy. Further, the environment reflected in the batch may not even have been produced by a single Markov decision process.</p>
<p>I'm interested in <em>this</em> paper because the authors apply meta RL to the problem, and claim to achieve good performance on unseen MDPs sampled from the same family as those represented by the training batch. If that's so, it has positive implications for my own work.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="5.-Reinforcement-Learning-with-Chromatic-Networks">5. Reinforcement Learning with Chromatic Networks<a class="anchor-link" href="#5.-Reinforcement-Learning-with-Chromatic-Networks">¶</a></h1><blockquote><p>We present a neural architecture search algorithm to construct compact reinforcement learning (RL) policies, by combining ENAS and ES in a highly scalable and intuitive way. By defining the combinatorial search space of NAS to be the set of different edge-partitionings (colorings) into same-weight classes, we represent compact architectures via efficient learned edge-partitionings. For several RL tasks, we manage to learn colorings translating to effective policies parameterized by as few as 17 weight parameters, providing >90% compression over vanilla policies and 6x compression over state-of-the-art compact policies based on Toeplitz matrices, while still maintaining good reward. We believe that our work is one of the first attempts to propose a rigorous approach to training structured neural network architectures for RL problems that are of interest especially in mobile robotics with limited storage and computational resources.</p>
</blockquote>
<p><a href="https://arxiv.org/abs/1907.06511v2">https://arxiv.org/abs/1907.06511v2</a></p>
<p>From the introduction:</p>
<blockquote><p>The main question we tackle in this paper is the following:</p>
<p>Are high dimensional architectures necessary for encoding efficient policies and if not, how compact can they be in in practice?</p>
</blockquote>
<p>More compact achitectures not only take less space, but also produce inferences more quickly and cheaply. This matters to me because my work is often done on cloud computing infrastructure, which incentivizes parsomony. I'm also professionally interested in neural architecture search for multi-task scaling purposes. More on this later, perhaps.</p>
<p>The authors find compact policies by jointly optimizing the RL objective and "the combinatorial nature of the network’s parameter sharing profile". Inspired by two <a href="https://arxiv.org/abs/1804.02395">other</a> <a href="https://arxiv.org/abs/1906.04358">papers</a>, they reduce the number of distinct weights by <em>sharing</em> a single weight between multiple neuronal connections. The first paper from which their inspiration for this arises used <a href="https://en.wikipedia.org/wiki/Toeplitz_matrix">Toeplitz matrices</a> to represent the neural network, and the second randomly assigns weights (Weight-Agnostic Neural Networks, or WANNs) and then learns the connection topology to maximize an RL goal.</p>
<blockquote><p>WANNs replace conceptually simple feedforward networks with general graph topologies using NEAT algorithm providing topological operators to build the network.</p>
<p>Our approach is a middle ground, where the topology is still a feedforward neural network, but the weights are partitioned into groups that are being learned in a combinatorial fashion using rein- forcement learning. While <a href="https://arxiv.org/abs/1504.04788">10</a> shares weights randomly via hashing, we learn a good partitioning mechanisms for weight sharing.</p>
</blockquote>
<p>How do they do this?</p>
<blockquote><p>We leverage recent advances in the ENAS (Efficient Neural Architecture Search) literature and theory of pointer networks to optimize over the combinatorial component of this objective and state of the art evolution strategies (ES) methods to optimize over the RL objective.</p>
<p>Our key observation is that ENAS and ES can naturally be combined in a highly scalable but conceptually simple way.</p>
</blockquote>
<p>Ah. So... <em>how</em> do they do this?</p>
<p>We'll both just have to read the whole paper. In my light read, I notice this one is so full of interesting insights and pointers to important results from other research that it's worth our time. Basically though, they alternate between neural architecture search and RL optimization, using their own ENAS variant to optimize a pointer network capable of partitioning weights to be shared.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="6.-Dynamical-Distance-Learning-for-Semi-Supervised-and-Unsupervised-Skill-Discovery">6. Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery<a class="anchor-link" href="#6.-Dynamical-Distance-Learning-for-Semi-Supervised-and-Unsupervised-Skill-Discovery">¶</a></h1><blockquote><p>Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both on a real-world robot and in simulation. We show that our method can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and just ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: <a href="https://sites.google.com/view/dynamical-distance-learning">https://sites.google.com/view/dynamical-distance-learning</a></p>
</blockquote>
<p><a href="https://arxiv.org/abs/1907.08225v2">https://arxiv.org/abs/1907.08225v2</a></p>
<p>I picked up this paper partly because reward shaping is currently of professional interest to me, but also because I'm watching Haarnoja for his work on distributional RL.</p>
<p>This paper is about making reward-shaping easier by learning a more direct distance measure for the purpose. In general, if you know your distance from a goal, there are many optimization methods available to you for reducing that distance and achieving your goal. The better this distance measure, the smoother the landscape, and the more quickly you arrive.</p>
<p>The semi-supervised way_in which they approach this problem also strikes me as relevant to AI safety, a topic of personal interest. With the unsupervised training of their dynamical distance measure, they add "a small amount of preference supervision" to set the task goal, and this results in its achievement. A manually-specified reward function is very dangerous, and I'm interested in any novel methods that avoid their direct use (or, more to the point, I'm interested in methods of motivating AIs which more directly align holistic human flourishing with the AI's objectives).</p>
<p>For a quick overview, don't miss the link they posted at the end of the abstract.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="7.-Model-Imitation-for-Model-Based-Reinforcement-Learning">7. Model Imitation for Model-Based Reinforcement Learning<a class="anchor-link" href="#7.-Model-Imitation-for-Model-Based-Reinforcement-Learning">¶</a></h1><blockquote><p>Model-based reinforcement learning (MBRL) aims to learn a dynamic model to reduce the number of interactions with real-world environments. However, due to estimation error, rollouts in the learned model, especially those of long horizon, fail to match the ones in real-world environments. This mismatching has seriously impacted the sample complexity of MBRL. The phenomenon can be attributed to the fact that previous works employ supervised learning to learn the one-step transition models, which has inherent difficulty ensuring the matching of distributions from multi-step rollouts. Based on the claim, we propose to learn the synthesized model by matching the distributions of multi-step rollouts sampled from the synthesized model and the real ones via WGAN. We theoretically show that matching the two can minimize the difference of cumulative rewards between the real transition and the learned one. Our experiments also show that the proposed model imitation method outperforms the state-of-the-art in terms of sample complexity and average return.</p>
</blockquote>
<p><a href="https://arxiv.org/abs/1909.11821v1">https://arxiv.org/abs/1909.11821v1</a></p>
<p>AGI seems likely to be model-based, rather than model-free. I think this because I (an AGI) personally reuse my own models all the time, frequently attempt near-transfer to solve some novel problem. So anything that claims progress on model-based learning is at least worth a look to me.</p>
<p>Earlier I blogged about <a href="http://localhost:8000/articles/2019/Jul/28/efficient-exploration-with-self-imitation-learning.html">Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy</a>, and looking at it now, I'm surprised I only <em>alluded</em> to their use of Transformers. In that paper, they learn to imitate a past trajectory by mimicking the trajectory distribution, conditioned on a past trajectory. <em>This</em> paper wants to create a model of the environment that similarly mimics its distribution, but they use Wasserstein GANs (WGANs) instead. GANs have been wildly successful in generative image models, and WGANs are an especially promising variant. I've been keeping an eye out for papers that use GANs in areas outside computer vision.</p>
<p>If you know what a WGAN is and you understand that the authors are trying to get a WGAN to mimic the environment's bounded trajectory segment transition distribution, then you can imagine what they're doing. They also provide a theoretical bound for the expected distributional error.</p>
<p>I'll mention in closing that in their experiments, they also end up doing better than most other methods, using 50% fewer samples. That it works at all suggests to me that the model is sufficiently accurate to take notice. GANs are interesting.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="8.-Zermelo's-problem:-Optimal-point-to-point-navigation-in-2D-turbulent-flows-using-Reinforcement-Learning">8. Zermelo's problem: Optimal point-to-point navigation in 2D turbulent flows using Reinforcement Learning<a class="anchor-link" href="#8.-Zermelo's-problem:-Optimal-point-to-point-navigation-in-2D-turbulent-flows-using-Reinforcement-Learning">¶</a></h1><blockquote><p>To find the path that minimizes the time to navigate between two given points in a fluid flow is known as Zermelo's problem. Here, we investigate it by using a Reinforcement Learning (RL) approach for the case of a vessel which has a slip velocity with fixed intensity, Vs , but variable direction and navigating in a 2D turbulent sea. We show that an Actor-Critic RL algorithm is able to find quasi-optimal solutions for both time-independent and chaotically evolving flow configurations. For the frozen case, we also compared the results with strategies obtained analytically from continuous Optimal Navigation (ON) protocols. We show that for our application, ON solutions are unstable for the typical duration of the navigation process, and are therefore not useful in practice. On the other hand, RL solutions are much more robust with respect to small changes in the initial conditions and to external noise, even when V s is much smaller than the maximum flow velocity. Furthermore, we show how the RL approach is able to take advantage of the flow properties in order to reach the target, especially when the steering speed is small.</p>
</blockquote>
<p><a href="https://arxiv.org/abs/1907.08591v2">https://arxiv.org/abs/1907.08591v2</a></p>
<p>This paper is a pure personal indulgence. I read James Gleick's <a href="https://www.amazon.com/gp/product/0143113453">Chaos</a> and got interested in dynamical systems theory. I don't know much, but I do know turbulent flows are a pain to predict, which I assume would mean they're a pain to navigate within. I've also heard that <a href="https://link.springer.com/article/10.1007/BF02312352">neural networks do pretty surprisingly well at predicting chaotic dynamics</a>, so I'm interested to see it "applied". The paper brings up several other examples of successful neural navigation and prediction:</p>
<blockquote><p>Promising results have been obtained when applying RL algorithms to similar problems, such as the training of smart inertial particles or swimming particles navigat- ing intense vortex regions [31], Taylor Green flows [32] and ABC flows [33]. RL has also been successfully imple- mented to reproduce schooling of fishes [34, 35], soaring of birds in a turbulent environments [36, 37] and in many other applications [38–40]. Similarly, in the recent years, artificial intelligence techniques are establishing them- selves as new data driven models for fluid mechanics in general [41–46].</p>
</blockquote>
<p>I can't state their results better than they can, so here you go:</p>
<blockquote><p>In this paper, we show that for the case of vessels that have a slip velocity with fixed intensity but variable di- rection, RL can find a set of quasi-optimal paths to efficiently navigate the flow. Moreover, RL, unlike ON, can provide a set of highly stable solutions, which are insensitive to small disturbances in the initial condition and successful even when the slip velocity is much smaller than the guiding flow. We also show how the RL protocol is able to take advantage of different features of the underlying flow in order to achieve its task, indicating that the information it learns is non-trivial.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>Surprisingly, even this list of eight do not cover <em>all</em> of the papers that sounded interesting to me. It was <em>quite</em> a good week for announcements on the arXiv.</li>
<li>This is the format I originally had in mind for arXiv highlights, but since the abstracts tend to invite questions that I can't answer without at least skimming the paper, I ended up reading them more thoroughly. With this format, I can cover more ground, but less deeply.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Active Perception in Adversarial Scenarios2019-09-22T00:00:00-04:002019-09-22T00:00:00-04:00Daniel Coxtag:computable.ai,2019-09-22:/articles/2019/Sep/22/active-perception-in-adversarial-scenarios.html<p>Accumulating evidence about peers to discriminate potential threats.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>This week's paper is <a href="https://arxiv.org/abs/1902.05644v1">Active Perception in Adversarial Scenarios using Maximum Entropy Deep Reinforcement Learning</a>. The idea is that an agent interacting with another agent can learn to assess the threat it may pose. It does this by actively testing the opponent agent's behavior, and does not assume the opponent's behavior remains stationary. It uses Bayesian filtering to update its belief about the disposition of the opponent, and that's why this paper caught my eye. I'm on a Bayesian kick lately.</p>
<blockquote><p>To summarize, the contribution here is the development of a scalable robust active perception method in scenarios where a potential adversary opponent could be actively hostile to the intent recognition activity, which extends and outperforms the POMDP methods.</p>
</blockquote>
<p>I'm a bit short on time this week, so I apologize for the amount of jargon and the unusually high level of confusion.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Problem-setup">Problem setup<a class="anchor-link" href="#Problem-setup">¶</a></h1><blockquote><p>We model the active perception problem as a planning problem, defined by the tuple $\langle S,A^a,A^o,T,O,R,b_0,\gamma \rangle$, where $S=\langle S^o,S^p \rangle$ is the state of the world, consisting of the set of observable states $S^o$ and the set of partially observable states $S^p$; $A^a$ is the set of actions of the autonomous agent; $A^o$ is the set of actions of the opponent; we further assume that regardless of the intention, the opponent has the same set of observable actions. Otherwise, an intention is easily identifiable once an action that is uniquely corresponding to that type of intention is observed. $T:S \times A^a \times A^o \rightarrow \Delta_S $ is the transition probability, where $\Delta_{\bullet}$ denotes the space of probability distribution over the space $\bullet$. $O: S \times A^a \rightarrow \Delta_{A^o}$ is the observation probability; $R: S \times A^a \times A^o \rightarrow \mathbb{R}$ is the reward function; $b_0$ is the prior probability of the opponent being an adversary; and $\gamma$ is the discount factor.</p>
</blockquote>
<p>Further, the opponent is assumed to be either neutral (merely self-interested, in a known way) or hostile (goal-directed, as defined by a known MDP), with bounded rationality, (it may not be able to take the optimal action) and it is likely to behave deceptively.</p>
<p>Notice that the actual behavior of the opponent is known if its disposition is known, which to my mind may or may not be a reasonable assumption, depending on the setting. Since I've had AI safety on the brain lately, it strikes me as <em>unrealistic</em> in a situation where your opponent is smarter than you are. It may be more realistic in settings where everyone has the same goal and it's relatively clear how anyway would try to achieve it if they didn't have to deal with other agents.</p>
<p>The authors' adversarial model is interesting. ($\lambda$ is the parameter to $\pi^o$ that specifies whether the agent is neutral: $\lambda=0$, or adversarial: $\lambda=1$):</p>
<blockquote><p>We use the following equation to model an adversarial agent's policy $\pi^o$:
$$
\begin{align}
\pi^o(a^o_t|s_t,\lambda=1;\alpha,\beta)= & \text{argmin}_{\pi \in \Delta} \{\mathbb{KL}(\pi|\pi^{\text{MDP}}_{\alpha})\\
& +\beta \mathbb{KL}(\pi|\pi^o(\cdot|s_t,\lambda=0)) \}
\pi^{\text{MDP}}_{\alpha}(a_t^o|s_t,\lambda=1)=e^{\alpha Q(s_t,a_t^o)}/Z(s_t)
\end{align}
$$</p>
</blockquote>
<p>The thing to take away from this is that both rationality and deception are tunable parameters. The rationality of the opponent is controlled by the temperature parameter $\alpha$, by adjusting how well the opponent makes use of the optimal Q function. The degree to which the opponent is <em>deceptive</em> is controlled by $\beta$, which adjusts how much the KL-divergence of the existing policy from the <em>neutral</em> policy affects the opponent's search for an optimal strategy.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Bayesian-filtering">Bayesian filtering<a class="anchor-link" href="#Bayesian-filtering">¶</a></h1><blockquote><p>We maintain a belief $b_t(\lambda)$ over the hidden variable by Bayesian filtering.</p>
</blockquote>
<p>As I mentioned, I'm rather short on time today, so I must apologize again for not actually spending the time to explain this. For now, suffice it to say that the opponent is either neutral ($\lambda=0$) or hostile ($\lambda=1$), and how your agent reacts to it depends very much on which one of those it believes it is playing against. Bayesian filtering will allow it to make the most of the evidence available, so it can use its best guess as it trains.</p>
<blockquote><p>We define a hybrid belief-state dependent reward to balance exploration and safety
\begin{equation}
\begin{aligned}
r(b_t,s_t,a^a_t)&=-H(b_t)+r(s_t,a^a_t)\\
&=b\log b+(1-b)\log(1-b)+r(s_t,a^a_t),
\end{aligned}
\label{eq6}
\end{equation}
where we use the shorthand $b$ to denote $b_t(\lambda=1)$, the belief that the opponent is an adversary; and $r(s_t,a^a_t)$ is the state dependent reward.</p>
<p>This reward balances exploration behavior and safety. The negative entropy reward $-H(b_t)$ can be interpreted as maximizing the expected logarithm of true positive rate (TPR) and true negative rate (TNR). The state-dependent reward $r(s_t,a^a_t)$ depends both on the observable state and the partially observable intent state $\lambda$, as well as the action of the autonomous agent. This reward
is used to ensure safety. For instance, some actions could be dangerous to the neutral [opponent], which are discouraged by a large negative reward.</p>
</blockquote>
<p>Our agent is trained using <a href="https://arxiv.org/abs/1702.08165">Soft-Q Learning</a> while values of $\lambda$ are varied, with corresponding opponent behavior. Interestingly, in the case study section the authors mention that the actual adversary models were not always provided in the learning phase.</p>
<blockquote><p>The active perception agent has to identify the hidden intent while bein grobust to this model uncertainty, which is challenging.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I admit to being a bit confused by this paper. The authors claim to do Bayesian filtering, but it's not an explicit feature of the algorithm. In fact, they seem to be sampling $\lambda$ for use in training by using only $b_0$, their prior probability for their belief state. Perhaps it's a typo.</li>
<li>They also seem to claim that the two models of the opponent behavior must be known, but then they mention they're not available during the learning phase in their case study. Drop me a line if this makes sense to you.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Discovery of Useful Questions as Auxiliary Tasks2019-09-15T00:00:00-04:002019-09-15T00:00:00-04:00Daniel Coxtag:computable.ai,2019-09-15:/articles/2019/Sep/15/discovery-of-useful-questions-as-auxiliary-tasks.html<p>Learning more like a human, and more like a scientist, by actively seeking useful auxiliary questions during learning.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In case you're wondering what happened to your feed reader this week: We've decided to retitle all of the arXiv highlights posts to be more attractive. We promise not to do this often, but it seemed like a good time to do it while we're inconveniencing very few people.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>This week's paper is <a href="https://arxiv.org/abs/1909.04607v1">Discovery of Useful Questions as Auxiliary Tasks</a> from the University of Michigan and DeepMind. It was accepted to NeurIPS 2019 (which I rather hope I'll be attending). The paper contains a very exciting concept that strikes at the heart of human learning: We learn not only by noticing statistical correlations and inferring concepts, but by actively seeking the answers to helpful questions that occur to us as we navigate the world. That's also much of what science is about: increasing your understanding of the world by choosing particularly good questions to ask.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Useful-questions-as-an-auxiliary-task">Useful questions as an auxiliary task<a class="anchor-link" href="#Useful-questions-as-an-auxiliary-task">¶</a></h1><p>The authors formulate the problem as a reinforcement learning problem with a main task you'd like to accomplished, augmented with auxiliary tasks generated by the system itself to aid in representation learning, and ultimately to accomplish the main task more efficiently. I've mentioned before that this is of professional interest to me.</p>
<p>In this paper the questions are represented as "general value functions" (GVFs), "a fairly rich form of knowledge representation", because</p>
<blockquote><p>GVF-based auxiliary tasks have been shown in previous work to improve the sampling efficiency of reinforcement learning agents engaged in learning some complex task....
It was then shown that by combining gradients from learning the auxiliary GVFs with the updates from the main task, it was possible to accelerate representation learning and improve performance. It fell, however, onto the algorithm designer to design questions that were useful for the specific task.</p>
</blockquote>
<p>The main insight in this paper is that the gradients induced while learning the main task contain information about what questions would aid in learning a helpful representation.</p>
<blockquote><p>The main idea is to use meta-gradient RL to discover the questions so that answering them maximises the usefulness of the induced representation on the main task.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Auxiliary-tasks">Auxiliary tasks<a class="anchor-link" href="#Auxiliary-tasks">¶</a></h1><p>Why should learning something other than the main task help? It teaches composable fundamentals relevant to the task so that the neural network doesn't have to learn everything from scratch all at once. The kinds of auxiliary tasks we're talking about here are things like controlling pixel intensities and feature activations. Other examples mentioned in the paper are auxiliary tasks where the agent needed to learn to measure depth, loop-closures (e.g., the letter "C" is not closed, but the letter "O" is), observation reconstruction (which, as an aside, can be used in the construction of intrinsically-motivated, "curious" agents), reward prediction, etc. When agents were required to learn each of these tasks simultaneously with learning their own main tasks, they learned more efficiently than when they were required to learn their main task alone.</p>
<p>But, as we just discussed, each of these examples (see the paper for more) and were hand-crafted. The agents themselves did not attempt to add to their tasks, and careful hand-tuning was required to get the observed improvements.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Meta-learning">Meta-learning<a class="anchor-link" href="#Meta-learning">¶</a></h1><blockquote><p>A meta-learner progressively improves the learning process of a learner that is attempting to solve some task.</p>
</blockquote>
<p>I can hardly overstate how useful this is. In my own work, we aren't done as soon as we've trained a neural network to perform well on a single task. There is an entire host of related tasks on which we'll need to retrain it in the future. Our work involves training an agent to control the behavior of some software, which is not fixed. If our agent cannot be quickly retrained on other software (perhaps out of our direct control), then it becomes much more expensive and difficult to maintain.</p>
<p>This paper mentions previous work in learning better initializations for a given task, learning to explore, unsupervised learning to develop a good or compact representation, few-shot model adaptation, and learning to improve the optimizers.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-discovery-of-useful-questions">The discovery of useful questions<a class="anchor-link" href="#The-discovery-of-useful-questions">¶</a></h1><p>This is Figure 1 of our paper, depicting the architecture that discovers and uses useful questions. It consists of two neural networks, a main task & answer network parametrized by $\theta$, and a question network parametrized by $\eta$. The main task & answer network takes the last $i$ observations $o_{t-i+1:t}$ in and produces two categories of output: a) decisions from the policy $\pi_t$ and b) answers to the "useful questions" $y_t$. The question network takes $j$ <em>future</em> observations $o_{t+1:t+j}$, and produces two outputs: a) <em>cumulants</em> $u_t$, and b) discounts $\gamma_t$. Cumulants (a term from the GVF literature) are described as scalar functions of the state, the sum of which must be maximized. To me, this just sounds like an obstruse way to say "other loss function", which makes sense because these are what are describing our auxiliary goals.</p>
<p><img src="https://computable.ai/images/useful_questions_figure1.png" alt="Auxiliary Question Discovery Arch"></p>
<p>Lest you think this method requires time travel, fear not. We can see $j$ steps into the future using the time machine of Waiting, which is ok because it only happens during training.</p>
<p>As the authors explain, previous work with auxiliary tasks would have only had the main task & answer network on the left, because the cumulants and discounts were hand-crafted. The question network on the right, and its effective use, is the main contribution of this paper. The <em>number</em> of "other loss functions" is still fixed, but the components of the actual functions that compute them (cumulants and discounts) are represented by an $\eta$-parametrized neural network that is itself trained <em>on the gradients of the $\theta$-parametrized main task and answer network</em>.</p>
<p>In the researcher's own words:</p>
<blockquote><p>In their most abstract form, reinforcement learning algorithms can be described by an update procedure $\Delta \theta_t$ that modifies, on each step $t$, the agent's parameters $\theta_t$. The central idea of meta-gradient RL is to parameterise the update $\Delta \theta_t(\eta)$ by meta-parameters $\eta$. We may then consider the consequences of changing $\eta$ on the $\eta$-parameterised update rule by measuring the subsequent performance of the agent, in terms of a "meta-loss" function $m(\theta_{t+k})$. Such meta-loss may be evaluated after one update (myopic) or $k > 1$ updates (non-myopic). The meta-gradient is then, by the chain rule,
\begin{align}
{\partial m(\theta_{t+k})} \over {\partial\eta} &= {\partial m(\theta_{t+k}) \over \partial\theta_{t+k}} {\partial\theta_{t+k} \over \partial\eta}.\label{eqn:no_approx}
\end{align}</p>
</blockquote>
<p>The actual computation of this is challenging, because changing $\eta$ affects updates to $\theta$ on <em>all future timesteps</em>. This is the reason training the question network requires looking $j$ steps "into the future". Holding $\eta$ fixed, they compute $\theta_t \rightarrow ... \rightarrow \theta_{t+j}$, in order to finally compute the meta-loss evaluation $m(\theta_{t+j})$.</p>
<p>The algorithm then alternates between normal RL training of the main task & answer network, and meta-gradient training of the question network to produce and use questions that maximize the performance of the agent on the original task. It is a very general solution, and empirically outperforms hand-designed auxiliary tasks in many cases.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>The authors themselves note that their algorithm augments an <em>on-policy</em> reinforcement learning algorithm, and I look forward to their promised future work adapting these techniques to an off-policy setting.</li>
<li>I notice I take detours from the main article purposes to write about areas of RL that I want to remember to investigate further in the future (e.g., auxiliary task in general, and meta-learning in general). That's a good habit, though I'll need to remember to cultivate it without seeming too distracted.</li>
<li>This paper mentions that Xu et al. in 2018 tried learning the discount factor $\gamma$ and the bootstrapping factor $\lambda$ (using meta-gradients), which is an idea I had myself (a year later). Apparently this substantially improved performance on the Atari domain, so I feel vindicated.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Deep Reinforcement Learning without Catastrophic Forgetting2019-09-09T00:00:00-04:002019-09-09T00:00:00-04:00Daniel Coxtag:computable.ai,2019-09-09:/articles/2019/Sep/09/deep-reinforcement-learning-without-catastrophic-forgetting.html<p>Long-term learning of multiple tasks without forgetting old skills, using a new technique called Pseudo-Rehearsal.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Apologies for missing a week. Today's post is on last-week's paper, and I'm going to skip this week to get back on track. Also experimenting with the format some more to keep things sustainable given my wildly variable weekend free time. If you have thoughts about this, please leave us a comment!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>This (last) week's paper is <a href="https://arxiv.org/abs/1812.02464">Pseudo-Rehearsal: Achieving Deep Reinforcement Learning without Catastrophic Forgetting</a>. I'm interested for reasons both professional and personal.</p>
<p>First, I have this problem. Our recent (successful) work has gotten neural nets to do some very interesting things, but expanding will require continuous training in production. This makes catastrophic forgetting (CF) a very real problem, since most of the DRL research assumes you're training your agent on a single task, and then enjoying it in inference mode forever after.</p>
<p>Second, I'm interested because I've got a little son, (the source of the variability in my weekend free time) and I often see him learn something mind-bogglingly fast, and then cement it over the course of a couple days. Pseudo-rehearsal is biologically plausible, and I'm interested in intelligence in its own right.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Catastrophic-Forgetting-and-Pseudo-rehearsal">Catastrophic Forgetting and Pseudo-rehearsal<a class="anchor-link" href="#Catastrophic-Forgetting-and-Pseudo-rehearsal">¶</a></h1><p>An agent trained on one task can learn to accomplish that task. If that same agent is then moved to another task, it will learn that other task, but often at the expense of "catastrophically forgetting" the neural net weights learned for the previous task. Several solutions have been proposed, (which are cited in today's paper, and I'll likely be reading them) but most are likely <em>not</em> what humans and animals do.</p>
<blockquote><p>Researchers have proposed extensions to this method such as utilising previous examples’ gradients during learning, picking a subset of previous samples which best represents the population and using a variational auto-encoder to compress stored items. Such rehearsal methods are cognitively implausible and therefore, do not shine light on how mammal brains might efficiently solve the CF problem.</p>
</blockquote>
<p>Pseudo-rehearsal trains a generative model (a GAN) to produce examples from all previous tasks, and uses this to implicitly rehearse foregoing data. Today's paper employes this scheme and a few other tricks to build a system capable of learning multiple tasks.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-RePR-model">The RePR model<a class="anchor-link" href="#The-RePR-model">¶</a></h1><p>The researchers dub their method RePR, and it works like this: They build short- and long-term memory systems, and transferring learned behaviors from short- to long-term memory while rehearsing past behavior in long-term memory.</p>
<p>The STM system:</p>
<blockquote><p>The first part of our model is the short- term memory (STM) system, which serves a similar function to the hippocampus and is used to learn the current task. The STM system contains two components, a DQN that learns the current task and an experience replay containing data only from the current task.</p>
</blockquote>
<p>The LTM system:</p>
<blockquote><p>The second part is the long-term memory (LTM) system, which serves a similar function to the cortex. The LTM system also has two components, a DQN containing knowledge of all tasks learnt and a GAN which can generate sequences representative of these tasks.</p>
</blockquote>
<p>They then do periodic consolidation:</p>
<blockquote><p>During consolidation, the LTM retains previous knowledge through pseudo-rehearsal, while being taught by the STM how to respond on the current task. All of the networks’ architectures and training parameters used throughout our experiments can be found in the appendices.
Transferring knowledge between these two systems is achieved through knowledge distillation, where a student network is optimised so that it outputs similar values to a teacher network.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>This sounds brilliant, and analogous to what mammals do. I'm eager to experiment with it, and to introspect and ponder how my own brain learns, with this new model in mind.</li>
<li>I wonder very much what we do in sleep. <a href="https://computable.ai/articles/2019/Mar/10/boltzmann-machines-differentiation-work.html">As I've mentioned before</a>, I'm quite attracted to the model described in <a href="https://theneural.wordpress.com/2011/07/08/the-miracle-of-the-boltzmann-machine/">The Miracle of the Boltzmann Machine</a>, but off-hand, I don't know how to reconcile that model with the concept of nightly rehearsal of the day's activities. Perhaps the brain is doing <em>two</em> things during sleep? Ockam's razor impells me to think again.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Reward tampering2019-08-25T00:00:00-04:002019-08-25T00:00:00-04:00Daniel Coxtag:computable.ai,2019-08-25:/articles/2019/Aug/25/reward-tampering.html<p>Improving safety and control by preventing all manner of reward tampering by the agent itself.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>This week I just want to pull the list of reward tampering methods from <a href="https://arxiv.org/abs/1908.04734">Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective</a> to promote awareness of this problem. The paper is interesting for several other reasons as well, and I commend it to you:</p>
<blockquote><p>Can an arbitrarily intelligent reinforcement learning agent be kept under control by a human user? Or do agents with sufficient intelligence inevitably find ways to shortcut their reward signal? This question impacts how far reinforcement learning can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Reward-tampering">Reward tampering<a class="anchor-link" href="#Reward-tampering">¶</a></h1><p>I've heard it said that no agent will ever become more intelligent than it takes to edit its own reward function, giving itself a simpler task. This paper treats such problems seriously, with some encouraging results.</p>
<blockquote><p>From an AI safety perspective, we must bear in mind that in any practically implemented system, agent reward may not coincide with user utility. In other words, the agent may have found a way to obtain reward without doing the task. This is sometimes called reward hacking or reward corruption. We distinguish between a few different types of reward hacking.</p>
</blockquote>
<h2 id="Reward-gaming-vs.-reward-tampering">Reward gaming vs. reward tampering<a class="anchor-link" href="#Reward-gaming-vs.-reward-tampering">¶</a></h2><p>The authors make a distinction between <em>reward gaming</em>, where the agent exploits a misspecification of the process that determines the rewards, and <em>reward tampering</em>, where the agent actually modifies that process. This paper is focused on the latter.</p>
<p>They then subdivide reward tampering into three subcategories, according to whether the agent has tampered with the function itself, the feedback that trains the reward function, or the input to the reward function.</p>
<h2 id="Hacking-the-reward-function:-Section-3">Hacking the reward function: Section 3<a class="anchor-link" href="#Hacking-the-reward-function:-Section-3">¶</a></h2><blockquote><p>First, regardless of whether the reward is chosen by a computer program, a human, or both, a sufficiently capable, real-world agent may find a way to tamper with the decision. The agent may for example hack the computer program that determines the reward. Such a strategy may bring high agent reward and low user utility. This reward function tampering problem will be explored in Section 3.</p>
<p>Fortunately, there are modifications of the RL objective that remove the agent’s incentiveto tamper with the reward function.</p>
</blockquote>
<p>In Section 3 the authors formalize the problem, and propose two reward variants that disincentivize tampering.</p>
<h2 id="Manipulating-the-feedback-mechanism:-Section-4">Manipulating the feedback mechanism: Section 4<a class="anchor-link" href="#Manipulating-the-feedback-mechanism:-Section-4">¶</a></h2><blockquote><p>The related problem of reward gaming can occur even if the agent never tamperswith the reward function. A promising way to mitigate the reward gaming problem isto let the user continuously give feedback to update the reward function, using online reward-modeling. Whenever the agent finds a strategy with high agent reward but low user utility, the user can give feedback that dissuades the agent from continuing the behavior. However, a worry with online reward modeling is that the agent may influence the feedback. For example, the agent may prevent the user from giving feedback while continuing to exploit a misspecified reward function, or manipulate the user to give feedback that boosts agent reward but not user utility. This feedback tampering problem and its solutions will be the focus of Section 4.</p>
</blockquote>
<p>Section 4 proposes several potential modifications to disincentivize or directly prevent feedback manipulation, ultimately with the recommendation that they be combined in an ensemble.</p>
<h2 id="Input-tampering:-Section-5">Input tampering: Section 5<a class="anchor-link" href="#Input-tampering:-Section-5">¶</a></h2><blockquote><p>Finally, the agent may tamper with the input to the reward function, so-called RF-input tampering, for example by gluing a picture in front of its camera to fool the reward function that the task has been completed. This problem and its potential solution will be the focus of Section 5.</p>
</blockquote>
<p>Very interestingly, Section 5 argues that model-based methods avoid the input tampering problem.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Results-summary">Results summary<a class="anchor-link" href="#Results-summary">¶</a></h1><blockquote><p>One way to prevent the agent from tampering with the reward function is to isolate or encrypt the reward function, and in other ways trying to physically prevent the agent from reward tampering. However, we do not expect such solutions to scale indefinitely with our agent’s capabilities, as a sufficiently capable agent may find ways around most defenses. Instead, we have argued for design principles that prevent reward tampering incentives, while still keeping agents motivated to complete the original task. Indeed, for each type of reward tampering possibility, we described one or more design principles for removing the agent’s incentive to use it. The design principles can be combined into agent designs with no reward tampering incentive at all.</p>
<p>An important next step is to turn the design principles into practical and scalable RL algorithms, and to verify that they do the right thing in setups where various types of reward tampering are possible. With time, we hope that these design principles will evolve into a set of best practices for how to build capable RL agents without reward tampering incentives. We also hope that the use of causal influence diagrams that we have pioneered in this paper will contribute to a deeper understanding of many other AI safety problems and help generate new solutions.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I look forward to reading this paper more thoroughly, both because I understand this problem of disincentivising reward hacking is <em>hard</em>, and because Causal Influence Diagrams sound interesting and generally useful.</li>
<li>AI safety is important, and I rather hope that awareness of some ways your agents could cheat will help to prevent such errors from leaking out into the world before they are caught.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
DRL Not Superhuman on Atari2019-08-18T00:00:00-04:002019-08-18T00:00:00-04:00Daniel Coxtag:computable.ai,2019-08-18:/articles/2019/Aug/18/drl-not-superhuman-on-atari.html<p>DRL may not be superhuman on Atari after all, and how to avoid making mistakes like that in the future.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>Just a sketch this week, calling your attention to <a href="https://arxiv.org/abs/1908.04683v1">Is Deep Reinforcement Learning Really Superhuman on Atari?</a>, which concludes not only that DRL is worse than the best humans on most Atari games, but by a <em>wide</em> margin.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="DRL-isn't-superhuman-on-Atari-yet">DRL <em>isn't</em> superhuman on Atari yet<a class="anchor-link" href="#DRL-isn't-superhuman-on-Atari-yet">¶</a></h1><p>Wait, what? I was quite skeptical of this claim. Mnih et al. published the groundbreaking <a href="https://arxiv.org/abs/1312.5602">Playing Atari with Deep Reinforcement Learning</a> in <em>2013</em>, claiming superhuman performance. Surely someone would have noticed by now?</p>
<p>Apparently not, and then most DRL algorithms for the next six years used either the same human scores reported in that paper, or human beginners. It's true that DQN significantly outperformed their own human player, but that player was not, by far, <em>the best in the world</em>. Other recent claims of superhuman performance have proven that claim against the best players in the world (the paper mentions AlphaGo against Lee Sedol, OpenAI Five against OG, and AlphaStar against Mana), but not for the Atari benchmark.</p>
<p>The most poignant detail to me in this paper involved the common "normalized human score", where 0% is the score of a random agent, and 100% is the score of the human baseline. <em>On this scale, the median score achieved by the world record holders across all Atari games is 4.4k%</em>. Clearly you can't claim superhuman performance if there are humans who beat your target by a factor of 44, unless you yourself exceed this score.</p>
<p>For reference, the original Rainbow algorithm achieved a median of 200% over all Atari games, and other algorithms seem to do worse. If the normalized human score is fitted to a maximum equal to the human world record for each game, and run with different time limits, a tuned IQN variant of Rainbow receives a median score of less than 4% (there were other problems with the way benchmarks were done, and correcting for them reduces performance even further).</p>
<p>We have a long way to go then. The paper has a useful analysis drawing on both previous and original research as to <em>why</em> DRL algorithms are so bad at Atari, and I encourage a careful reading. Some of them, such as reward clipping, are called out in previous research as explicitly chosen to improve performance, but (to treat this particular example), it has been mentioned that this causes the agent to prefer many small rewards over a single large reward.</p>
<p>I encourage anyone working with the Atari benchmark to read the paper for themselves.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I actually find it somewhat personally encouraging that there's room for improvement on Atari. It's easy to experiment, and I have some ideas myself.</li>
<li>That said, it is rather scary that we could overlook something like this for so long, as a community.</li>
<li>Anyway, <em>someone</em> will take this as a call to arms, and make progress. Peter Drucker said, "If you can't measure it, you can't improve it." Now that we have better measurements, I predict improvements.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Three Method Comparison for Traffic Signal Control2019-08-11T00:00:00-04:002019-08-11T00:00:00-04:00Daniel Coxtag:computable.ai,2019-08-11:/articles/2019/Aug/11/three-method-comparison-for-traffic-signal-control.html<p>Comparing supervised learning, random search, and deep reinforcement learning on traffic signal control.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>This week's paper, <a href="https://arxiv.org/abs/1908.02673v1">Large-scale traffic signal control using machine learning: some traffic flow considerations</a>, caught my eye for several reasons. First, traffic signal control is relevant to my own group's work involving microservice and network traffic management. Second, the authors use cellular automaton rule 184 as their traffic model, which is actually the first time I've seen a cellular automaton used for something serious since <a href="https://www.wolframscience.com/nks/">A New Kind of Science</a>, despite that book's claim about the likely broad usefulness of simple programs for complex purposes. Lastly, the authors find that supervised learning and random search outperform deep reinforcement learning for high-occupancies of the traffic flow network,</p>
<blockquote><p>For occupancies > 75% during training, DRL policies perform very poorly for all traffic conditions, which means that DRL methods cannot learn under highly congested conditions.</p>
</blockquote>
<p>and that they recommend practitioners <em>throw away</em> congested data!</p>
<blockquote><p>Our findings imply that it is advisable for current DRL methods in the literature to discard any congested data when training, and that doing this will improve their performance under all traffic conditions.</p>
</blockquote>
<p>I also have to admit that I've thought to myself, waiting at empty intersections for a light to turn green, that I could just <em>solve</em> this problem with DRL. If I'm wrong, that would be very interesting and surprising.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Considerations-in-a-nutshell">Considerations in a nutshell<a class="anchor-link" href="#Considerations-in-a-nutshell">¶</a></h1><p>The introduction and background are well summarized in their last paragraph:</p>
<blockquote><p>In summary, most recent studies focus on developing effective and robust multi-agent DRL algorithms to achieve coordination among intersections. The number of intersections in those studies are usually limited, thus their results might not apply to large open network. Although the signal control is indeed a continuing problem, it has been always modeled as an episodic process. From the perspective of traffic considerations, expert knowledge has only been incorporated in down-scaling the size of the control problem or designing novel reward functions for DRL algorithm. Few studies have tested their methods given different traffic demands, or shed lights on the learning performance under different traffic conditions, especially the congestion regimes. To fill the gap, our study will treat the large-scale traffic control as a continuing problem and extend classical RL algorithm to fit it. More importantly, noticing the lack of traffic considerations on learning performance, we will train DRL policies under different density levels and explore the results from a traffic flow perspective.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Set-up">Set up<a class="anchor-link" href="#Set-up">¶</a></h1><h2 id="Traffic">Traffic<a class="anchor-link" href="#Traffic">¶</a></h2><p><img src="http://atlas.wolfram.com/01/01/184/01_01_108_184.gif#right" alt="CA Rule 184"></p>
<p>This is elementary cellular automaton (CA) rule 184. Elementary cellular automata operate on a binary vector, producing a new binary vector in each step that's a function of the previous one. For each entry in the previous vector, the new value of the corresponding entry in the resulting vector depends on the previous entry and its neighbors to the left and right. There are 256 possible rules with this formulation, and this picture is of the 184th rule set when ordered in the natural way.</p>
<p>Rule 184 can be thought of as a flow of cars along a lane of traffic. Cars move forward (right) by one cell each step only if there is an open space in front of them, otherwise they wait for one to open up. Here's an example:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [1]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">rule_184</span><span class="p">(</span><span class="n">lane</span><span class="p">):</span>
<span class="n">l</span> <span class="o">=</span> <span class="p">[</span><span class="kc">False</span><span class="p">]</span> <span class="o">+</span> <span class="n">lane</span> <span class="o">+</span> <span class="p">[</span><span class="kc">False</span><span class="p">]</span> <span class="c1"># pad</span>
<span class="k">return</span> <span class="p">[(</span><span class="n">l</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">l</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="ow">or</span> <span class="p">(</span><span class="n">l</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="ow">and</span> <span class="n">l</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">])</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)]</span>
<span class="k">def</span> <span class="nf">show</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">lane</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'t</span><span class="si">{t}</span><span class="s1">:</span><span class="se">\t</span><span class="s1">'</span><span class="p">,</span> <span class="s1">' '</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s1">'🚘'</span> <span class="k">if</span> <span class="n">i</span> <span class="k">else</span> <span class="s1">'_'</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">lane</span><span class="p">])</span> <span class="p">)</span>
<span class="n">ti</span> <span class="o">=</span> <span class="p">[</span><span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">,</span> <span class="kc">False</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">7</span><span class="p">):</span>
<span class="n">show</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">ti</span><span class="p">)</span>
<span class="n">ti</span> <span class="o">=</span> <span class="n">rule_184</span><span class="p">(</span><span class="n">ti</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt"></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>t0: 🚘 🚘 🚘 🚘 🚘 _ _ 🚘 _ _ _ _ _ _ _
t1: 🚘 🚘 🚘 🚘 _ 🚘 _ _ 🚘 _ _ _ _ _ _
t2: 🚘 🚘 🚘 _ 🚘 _ 🚘 _ _ 🚘 _ _ _ _ _
t3: 🚘 🚘 _ 🚘 _ 🚘 _ 🚘 _ _ 🚘 _ _ _ _
t4: 🚘 _ 🚘 _ 🚘 _ 🚘 _ 🚘 _ _ 🚘 _ _ _
t5: _ 🚘 _ 🚘 _ 🚘 _ 🚘 _ 🚘 _ _ 🚘 _ _
t6: _ _ 🚘 _ 🚘 _ 🚘 _ 🚘 _ 🚘 _ _ 🚘 _
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The cellular automaton simulates a lane of traffic, and the authors wire two of these lanes up between each adjacent traffic light to create a grid network. The network is laid out on a torus, so there are no boundaries.</p>
<blockquote><p>The signalized network corresponds to a homogeneous grid network of bidirectional streets, with one lane per direction of length $n = 5$ cells between neighboring traffic lights.</p>
</blockquote>
<p><img src="https://computable.ai/images/signalized_network.png" alt="Signalized network"></p>
<blockquote><p>The connecting links to form the torus are shown as dashed directed links; we have omitted the cells on these links to avoid clutter. Each segment has n = 5 cells; an additional cell has been added downstream of each segment to indicate the traffic light color.</p>
</blockquote>
<p>Cars arriving at a green traffic light choose a random "direction" in which to continue. Green lights are on for a minimum of three steps.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Learning">Learning<a class="anchor-link" href="#Learning">¶</a></h2><p>Each traffic signal is managed by an agent, which has two actions it can take at any time step: turn the light red/green for the North-South approaches, or the opposite. The state observable by each agent is an $8\times n$ matrix of bits corresponding to the four incoming and four outgoing CA vectors, and the output is the probability of turning the light red for the North-South approaches. Only one neural net is actually trained, and used by all agents, since there's no reason for them to be different in this formulation. For the DRL agent, the reward is the <em>incremental</em> average flow per lane (not the average flow per lane), which the authors mention is lower-variance. The authors use a custom infinite-horizon variant of REINFORCE they call REINFORCE-TD.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Experiments">Experiments<a class="anchor-link" href="#Experiments">¶</a></h1><p>The authors use a maximum-queue-first (LQF) greedy algorithm as their baseline for comparison, which services the lane with the longest queue length at all times.</p>
<h2 id="Random-policies">Random policies<a class="anchor-link" href="#Random-policies">¶</a></h2><p><img src="https://computable.ai/images/traffic_signals_figure4.png" alt="Figure 4"></p>
<p>They begin by randomly reinitializing the parameters of the neural network, and discover that ~15% of random policies are competitive (that is, they can outperform LQF for some traffic densities). They also note a previously undiscovered pattern that "all policies, no matter how bad, are best when the density exceeds approximately 75%." How odd.</p>
<h2 id="Supervised-learning-policies">Supervised learning policies<a class="anchor-link" href="#Supervised-learning-policies">¶</a></h2><p><img src="https://computable.ai/images/traffic_signals_figure5.png" alt="Figure 5"></p>
<p>They then train a policy with supervised learning, and surprisingly, with only the two obvious extreme examples, the resulting policy is near-optimal.</p>
<h2 id="DRL-policies">DRL policies<a class="anchor-link" href="#DRL-policies">¶</a></h2><p><img src="https://computable.ai/images/traffic_signals_figure6.png" alt="Figure 6"></p>
<blockquote><p>Policies trained with constant demand and random initial parameters $\theta$. The label in each diagram gives the iteration number and the constant density value. First column: NS red probabilities of the extreme states, $\pi(s1)$ in dashed line and $\pi(s2)$ in solid line. The remaining columns show the flow-density diagrams obtained at different iterations, and the last column shows the iteration producing the highest flow at $k = 0.5$, if not reported on a earlier column.</p>
</blockquote>
<p>Finally, they run two experiments with DRL policies, as described above. These policies seem to do rather poorly in general compared to random search and supervised learning, and as density increases, they stop learning much of anything.</p>
<blockquote><p>We conjecture that this result is a consequence of a property of congested urban networks and has nothing to do with the algorithm to train the DRL policy.</p>
</blockquote>
<p>I'm skeptical. See my parting thoughts.</p>
<p>The other experiments the authors perform just confirms that average flow per lane does worse than incremental average flow per lane.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>In the end, I'm way more interested in the experimental setup of this paper than the conclusions. As usual, I learned a ton, and I may actually use rule 184 as a model for traffic flow on something.</li>
<li>Isn't it <em>obvious</em> given their problem formulation that the agents can't learn under conditions of congestion, since it means their input is essentially whited out? I would be more impressed with the conclusion if a neural net with complete visibility had trouble learning with congestion. It also seems to me <em>extremely</em> suggestive that a supervised policy can learn from only two examples, and I would very much like to see if the major conclusions of this paper explode with a more realistic network topology. Queueing theory contains all sorts of counterintuitive surprises, and it seems likely to me that their results are more indicative of one of those surprises, rather than some deep fact about DRL's ability to manage urban congestion.</li>
<li>It's interesting that they formulate the problem as a continuing one, against the prevailing trend in the traffic signal control literature. I agree with them, that even if you get to a state where there's no traffic, that's a function of the demand, not of the agent's choices. I bring this up because I too have found that it's <em>really quite important</em> to recognize an infinite-horizon problem when you have one, or else your agent learns to rack up debts until the end of the artificial episode when all is "forgiven".</li>
<li>It's fascinating that all random policies, no matter how bad, are best around 75% congestion. I have been admonished to avoid scheduling myself at more than 70% capacity to avoid the ringing effect. I wonder if this is an empirical vindication of that...</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Learning Compound and Composable Policies2019-08-04T00:00:00-04:002019-08-04T00:00:00-04:00Daniel Coxtag:computable.ai,2019-08-04:/articles/2019/Aug/04/learning-compound-and-composable-policies.html<p>Straightforward hierarchical RL for concurrent discovery of sub-policies and their controller.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>Just a sketch this week, of <a href="https://arxiv.org/abs/1905.09668">Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies</a>.</p>
<p>I've been hearing hierarchical RL mentioned frequently lately, and while I understand it's a way to encode human expertise to achieve otherwise intractible goals, it has also seemed a bit like cheating. However, I have a day job, and this serves as a healthy dose of pragmatism. I also think that even when the goal is fundamental progress, it's often a good idea to achieve the goal <em>in any way possible</em>, and then follow-up by working the cheats out of the system one by one. So when I read the abstract of this paper, I was feeling more receptive than previously.</p>
<p>Part of what made hierarchical RL seem not worth the cheating was how kludgy and inefficient the usual methods were, retraining a whole new policy from scratch for each subtask. That's why this week's paper caught my eye:</p>
<blockquote><p>... we propose an algorithm for learning both compound and composable policies <strong>within the same learning process</strong> by exploiting the off-policy data generated from the compound policy.</p>
</blockquote>
<p>Their resulting algorithm, "Hierarchical Intentional-Unintentional Soft Actor-Critic" (HIU-SAC), efficiently trains all sub-policies simultaneously, choosing actions to perform in the environment using a weighted average of the "votes" of all sub-policies, with weights given by a learned selector network (which is <em>also</em> simultaneously trained).</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Composable-hierarchical-RL">Composable hierarchical RL<a class="anchor-link" href="#Composable-hierarchical-RL">¶</a></h1><h2 id="Architecture">Architecture<a class="anchor-link" href="#Architecture">¶</a></h2><p><img alt="Hierarchical policy diagram" src="https://computable.ai/images/policy_network.png#right" height="300px" width="300px" style="margin: 10px" /></p>
<p>The composite policy consists of the individual policy networks, each with its own reward function, trained to take observations $s$ in and output parameters of a conditional Gaussian. There is also a special activation vector selector network trained on the same states to produce weights corresponding to how much each constituent policy applies to the current state. All of these networks share early layers, since they all benefit from an accurate high-level state representation. Finally, some function $f$ takes all of these outputs and determines what action $a$ to <em>actually</em> take in the environment.</p>
<p><img alt="Q-value function diagram" src="https://computable.ai/images/q_fcn_network.png#left" height="250px" width="250px" style="margin: 10px" /></p>
<p>The Q function networks are similarly arranged, sharing early layers which take a state $s$ and an action $a$ to produce a Q function for each subtask, as well as a composite Q function.</p>
<div style="clear:both"> </div><h2 id="Simultaneous-learning">Simultaneous learning<a class="anchor-link" href="#Simultaneous-learning">¶</a></h2><blockquote><p>Most methods learn the composable tasks one at a time, and later, the compound task. This procedure is not scalable as all the experience collected for each learning process is only used for that specific process. Also, it is not possible to start learning more complex tasks unless all the compos- able policies have been successfully learned. The method proposed in this section is based on the idea that a single stream of experience can be used to improve not only the policy that is generating the behavior but also, indirectly, many other policies.</p>
</blockquote>
<p>The authors refer to the composite policy acting as the "intentional" policy (the "behavior" policy in an off-policy setting), and the composable sub-policies as the "unintentional" policies (each one a "target" policy in an off-policy setting). They use a variation on SAC to train the composite and composable policies simultaneously within the maximum entropy framework.</p>
<p>The objective function for the Q networks simply maximize the expected sum of all mean-squared Bellman errors for each Q network, for each tuple in the replay buffer $\mathcal{D}$. The objective function for the policy is simply the sum of the objective functions for each intentional and unintentional policy. Each policy objective optimizes the expected difference for each state in $\mathcal{D}$ between the Q value and log-probability of the selected action (adjustable by temperature $\alpha$), over all possible actions. HIU-SAC then alternates between policy evaluation and policy improvement steps following SAC.</p>
<h2 id="The-importance-of-maximizing-entropy-to-adequate-exploration">The importance of maximizing entropy to adequate exploration<a class="anchor-link" href="#The-importance-of-maximizing-entropy-to-adequate-exploration">¶</a></h2><p>It is interesting that the entropy-maximizing RL objective was <em>absolutely necessary</em> for exploring broadly enough to train all of these policies at once.</p>
<blockquote><p>Note that populating the replay memory buffer with rich experiences is essential for acquiring multiple skills in an off-policy manner. The composable policies learned unintentionally had similar performance than the policies obtained in single-task formulations only when the compound policy was able to efficiently explore the environment. For this reason, the algorithm was built on a maximum entropy RL framework to favor exploration during the learning process.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>In a way, the methods proposed here seem rather obvious, and I found this paper quite easy to understand given that it violated none of my expectations. I also haven't been paying enough attention to hierarchical RL to know off-hand why training the sub-policies in parallel off of the same recorded environment interactions hasn't been tried before (or whether it has been without my notice). Perhaps it was necessary for off-policy RL to reach a level of maturity sufficient for sub-policies to see enough relevant data to train? In any case, don't hear me faulting the authors for trying the obvious. It is relieving a <em>non</em>-obvious that a straightforward formulation works so well.</li>
<li>I'd love to see this work combined with imitation learning and inverse RL to figure out what sub-policies are necessary in the first place from demonstrations. That seems like a very practical framework for real-world learning.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Efficient exploration with self-imitation learning2019-07-28T00:00:00-04:002019-07-28T00:00:00-04:00Daniel Coxtag:computable.ai,2019-07-28:/articles/2019/Jul/28/efficient-exploration-with-self-imitation-learning.html<p>I wonder if that happens every time...</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>Several paper caught my eye this week, but I'll be discussing only <a href="https://arxiv.org/abs/1907.10247">Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy</a> in more depth. I'm choosing this paper because, as happens sometimes, I had this idea myself a few weeks ago. It's especially exciting to see something you suspected might improve the world fleshed out and vindicated.</p>
<p>This is the basic form of my shower-throught idea:</p>
<blockquote><p>This paper investigates the imitation of diverse past trajectories and how that leads [to] further exploration and avoids getting stuck at a sub-optimal behavior. Specifically, we propose to use a buffer of the past trajectories to cover diverse possible directions. Then we learn a trajectory-conditioned policy to imitate any trajectory from the buffer, treating it as a demonstration. After completing the demonstration, the agent performs random exploration.</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-problem">The problem<a class="anchor-link" href="#The-problem">¶</a></h1><p><img src="https://computable.ai/images/maze_icon_map.png#right" alt="Maze"></p>
<p>The main problem the authors want to solve is insufficient exploration leading to a sub-optimal policy. If you don't explore your environment enough, you will find local rewards, but miss globally optimal rewards. In this maze (their Figure 1), you can see that an agent that fails to explore will collect two apples in the next room, but may miss acquiring the key, unlocking the door, collecting an apple, and discovering the treasure.</p>
<p>In the notoriously difficult Atari game (for RL agents) Montezuma's Revenge, it is similarly extremely unlikely that random exploration suffices to explore the environment and achieve a high score. The authors report state-of-the-art performance without expert demonstrations on Montezuma's Revenge, netting 25k points.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="SOTA-without-demonstrations">SOTA without demonstrations<a class="anchor-link" href="#SOTA-without-demonstrations">¶</a></h1><p>So, more precisely, how did they achieve this, and why does it work?</p>
<blockquote><p>The main idea of our method is to maintain a buffer of diverse trajectories collected during training and to train a trajectory-conditioned policy by leveraging reinforcement learning and supervised learning to roughly follow demonstration trajectories sampled from the trajectory buffer. Therefore, the agent is encouraged to explore beyond various visited states in the environment and gradually push its exploration frontier further... We name our method as Diverse Trajectory-conditioned Self-Imitation Learning (DTSIL).</p>
</blockquote>
<h2 id="The-trajectory-buffer">The trajectory buffer<a class="anchor-link" href="#The-trajectory-buffer">¶</a></h2><p>Their trajectory buffer $\mathcal{D}$ contains $N$ 3-tuples $\{\left(e^{(1)}, \tau^{(1)}, n^{(1)}\right), \left(e^{(2)}, \tau^{(2)}, n^{(2)}\right), \ldots \left(e^{(N)}, \tau^{(N)}, n^{(N)}\right) \}$ where $e^{(i)}$ is a high-level state representation, $\tau^{(i)}$ is the shortest trajectory achieving the highest reward and arriving at $e^{(i)}$, and $n^{(i)}$ is the number of times $e^{(i)}$ has been encountered. Whenever they roll out a new episode, they check each high-level state representation encountered against those in $\mathcal{D}$, increment $n$, and if $\tau$ is better they replace $\tau$ for that entry.</p>
<h2 id="Sampling">Sampling<a class="anchor-link" href="#Sampling">¶</a></h2><p>When training their trajectory-conditioned policy, they sample each 3-tuple with weight ${1}\over{\sqrt{n^{(i)}}}$. Notice that this will cause them to sample <em>less</em> frequently-visited states more often, encouraging exploration.</p>
<h2 id="Imitation-reward">Imitation reward<a class="anchor-link" href="#Imitation-reward">¶</a></h2><p>Given a trajectory $g$ sampled from the buffer, and during interaction with the environment, the agent receives a positive reward if the current state has an embedding within some $\Delta t$ of the current timestep in $g$. Otherwise the imitation reward is 0. Once it reaches the end of $g$, there is no further imitation reward, and it explores randomly. The imitation reward is one of two components of the $r^{DTSIL}_{t}$ RL reward, where the other is a simple monotonic function of the reward received at each timestep.</p>
<h2 id="Policy-architecture">Policy architecture<a class="anchor-link" href="#Policy-architecture">¶</a></h2><p>The DTSIL policy architecture is recurrent and attentional, inspired by machine translation!</p>
<blockquote><p>Inspired by neural machine translation methods, the demonstration trajectory is the source sequence and the incomplete trajectory of the agent’s state representations is the target sequence. We apply a recurrent neural network and an attention mechanism to the sequence data to predict actions that would make the agent to follow the demonstration trajectory.</p>
</blockquote>
<h2 id="RL-objective">RL objective<a class="anchor-link" href="#RL-objective">¶</a></h2><p>DTSIL is trained using a policy gradient algorithm (PPO, in their experiments), and RL loss</p>
$$\mathcal L^{RL} = {\mathbb{E}}_{\pi_\theta} [-\log \pi_\theta(a_t|e_{\leq t}, o_t, g) \widehat{A}_t]$$<p>where $$\widehat{A}_t=\sum^{n-1}_{d=0} \gamma^{d}r^\text{DTSIL}_{t+d} + \gamma^n V_\theta(e_{\leq t+n}, o_{t+n}, g) - V_\theta(e_{\leq t}, o_t, g)$$</p>
<h2 id="SL-objective">SL objective<a class="anchor-link" href="#SL-objective">¶</a></h2><p>In each parameter optimization step, they also include a supervised loss designed to maximize the log probability of taking an action that imitates the chosed demonstration exactly to better leverage a past trajectory $g$.</p>
$$\mathcal L^\text{SL} = - \log \pi_\theta(a_t|e_{\leq t}, o_t, g) \text{, where } g = \{e_0, e_1, \cdots, e_{|g|}\}$$<h2 id="Optimization">Optimization<a class="anchor-link" href="#Optimization">¶</a></h2><p>The final parameter update is thus</p>
$$\theta \gets \theta - \eta \nabla_\theta (\mathcal{L}^\text{RL}+\beta \mathcal{L}^\text{SL})$$
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I <em>love</em> seeing methods developed for generative language models used in another context entirely, to generate another kind of sequence. I'm overjoyed that it worked well.</li>
<li>They need a high-level embedding for two reasons: first because storing entire trajectories exactly in memory is expensive, and second because it's quite difficult to re-execute a previously-encountered trajectory exectly, so in order for this method to work at all it's important that an <em>approximate</em> re-execution be possible.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Keeping to the Narrow Path2019-07-21T00:00:00-04:002019-07-21T00:00:00-04:00Daniel Coxtag:computable.ai,2019-07-21:/articles/2019/Jul/21/keeping-to-the-narrow-path.html<p>Better imitation learning with self-correcting policies by negative sampling.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>This week's highlight is a paper on imitation learning: <a href="https://arxiv.org/abs/1907.05634">Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling</a>, chosen again for pragmatic reasons. The problem my team is currently working on has both reasons for wanting high sample efficiency: training would be prohibitively slow without something to kickstart it, and actions taken in the real world can get expensive.</p>
<p>I know I said I'd be experimenting with shorter, more bite-sized posts, but... next time. (If you want that, you can just stop reading after the "Key intuition" section.)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-problem">The problem<a class="anchor-link" href="#The-problem">¶</a></h1><p>Learning from demonstrations is more difficult than it may seem at first glance. The trouble mainly stems from covariate shift: the input distribution your agent will see in production is very likely to be different than that encountered during training. Many machine learning algorithms have this problem, reinforcement learning algorithms included, but imitation learning has it especially bad, for a simple reason: the expert demonstrations you are attempting to follow necessarily explore a very small subset of the state space. The whole <em>point</em> of them is to stay on good trajectories, meaning bad trajectories never get explored.</p>
<p>This causes two issues:</p>
<ol>
<li>The agent can't in general figure out how to get back into the subset of state space where the expert demonstrations apply, even if it gets only slightly off-course, and</li>
<li>Value functions for states and actions are affected by unseen states, making it very <em>likely</em> that the agent will wander off as soon as it's allowed.</li>
</ol>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Key-intuition">Key intuition<a class="anchor-link" href="#Key-intuition">¶</a></h1><p>The authors solve this problem by pre-training with supervised learning using a loss function that drives down the value of all states outside of those explored in the expert demonstrations $U$, by an amount proportional to their Euclidean distance from the closest state in $U$. In their own words:</p>
<blockquote><p>Consider a state $s$ in the demonstration and its nearby state $\tilde{s}$ that is not in the demonstration. The key intuition is that $\tilde{s}$ should have a lower value than $s$, because otherwise $\tilde{s}$ likely should have been visited by the demonstrations in the first place. If a value function has this property for most of the pair $(s,\tilde{s})$ of this type, the corresponding policy will tend to correct its errors by driving back to the demonstration states because the demonstration states have locally higher values.</p>
</blockquote>
<p>And Figure 1 is a nice visual demonstration:</p>
<p><a href="https://computable.ai/images/VINS_Figure_1.jpeg"><img alt="VINS Figure 1" src="https://computable.ai/images/VINS_Figure_1.jpeg" /></a></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Value-Iteration-with-Negative-Sampling-(VINS)">Value Iteration with Negative Sampling (VINS)<a class="anchor-link" href="#Value-Iteration-with-Negative-Sampling-(VINS)">¶</a></h1><p>Into the weeds now.</p>
<h2 id="Self-correctable-policy">Self-correctable policy<a class="anchor-link" href="#Self-correctable-policy">¶</a></h2><p>The first bit of their algorithm is the definition of their self-correcting policy. It's essentially a formalization of what we said above about $s$ and $\tilde{s}$.</p>
<p>If $s \in U$ (if $s$ is in the expert demonstrations), then $$V(s) = V^{\pi_e}(s) \pm \delta_V$$ ("just what the value would be in the expert demonstrations, plus some error").</p>
<p>But if $s \not\in U$, $$V(s) = V^{\pi_e}(\Pi_U(s)) - \lambda \|s-\Pi_U(s)\| \pm \delta_V$$ (where $\Pi_U$ gives the closest $s \in U$, so $V(s)$ is "the value of the closest $s \in U$, <em>minus the distance to that</em> $s \in U$, plus some error")</p>
<p>Then the induced policy from this value function is $$\pi(s) \triangleq \underset{a: \|a-\pi_{BC}(s)\|\le \zeta}{\operatorname{argmax}} ~V(M(s, a))$$</p>
<p>Where $M(s,a)$ is a learned dynamical model of the environment that gives the next state given the current state and action. $\pi_{BC}(s)$ is the "behavioral clone" policy from the expert demonstrations.</p>
<h2 id="RL-algorithm">RL algorithm<a class="anchor-link" href="#RL-algorithm">¶</a></h2><p>To actually achieve $V(M(s,a))$ with the necessary properties, they select a state $s$ from the demonstrations, perturb it a bit to get $\tilde{s}$ nearby, and use the original state $s$ to approximate $\Pi_U(\tilde{s})$ in the following loss function.</p>
$$\mathcal{L}_{ns}(\phi)= \mathbf{E}_{s \sim \rho^{\pi_e}, \tilde{s} \sim perturb(s)} \left(V_{\bar \phi}(s) - \lambda \|s-\tilde{s}\|- V_\phi(\tilde{s}) \right)^2$$<p>Finally, here's the algorithm that uses this and the earlier policy definition:</p>
<p><img src="https://computable.ai/images/VINS_Algorithm_2.jpeg#center" alt="VINS Algorithm 2"></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I thought it was quite strange that they learned $V(s)$ and a dynamical model $M(s,a)$, and then used $V(M(s,a))$ in the algorithm. I thought, "Why not just learn $Q$?" The answer was given in their Section A appendix, and was quite interesting. I'm not sure it applies to our case, but it's important. TL;DR $Q(s,a)$ learned from demonstrations <em>alone</em> is degenerate, because there's always a $Q$ that perfectly matches the demonstrations <em>and doesn't depend at all on</em> $a$. </li>
<li>One of my coworkers (and upcoming Computable author!) wondered to me if the induced policy could be made explicit, by explicitly training a policy network to bring the agent back into safe territory. It could be trained with gradient descent, because $V(M(s,a))$ are just networks, and the technique for training deterministic policies just follows the gradient of the $Q$ function. I wonder too.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Way Off-Policy Batch DRL2019-07-14T00:00:00-04:002019-07-14T00:00:00-04:00Daniel Coxtag:computable.ai,2019-07-14:/articles/2019/Jul/14/way-off-policy-batch-drl.html<p>Pre-training using a generative model of pre-recorded trajectories and bias correction.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>Only one paper this week, <em>not</em> because <a href="https://arxiv.org/abs/1905.04819">others</a> failed to catch my eye, but for brevity. Let me know in the comments if you agree that shorter or more focused articles are more attractive. So this week I'll be examining just one paper: <a href="https://arxiv.org/abs/1907.00456">Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog</a>. As with last week's papers, this week's is interesting to me professionally. Batch DRL is a way to solve the sample efficiency problem, from a certain perspective. It's mostly the online learning that costs too much when sample efficiency is low, so solving the problems that come with attempting to train offline might allow us to do many of the same things we could do if we had high online sample efficiency.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="RL-for-open-domain-dialog-generation">RL for open-domain dialog generation<a class="anchor-link" href="#RL-for-open-domain-dialog-generation">¶</a></h1><p>The author's domain is dialog generation. They want to build a better chat bot, and they have quite a few recorded conversations. RL is good at refining these processes, but has a cold-start problem, plus they would certainly prefer to make use of the data they have on-hand. For this, they need to be able to make use of offline data, hence "<em>Way</em> Off-Policy". This data is so off-policy it wasn't even <em>generated</em> by a policy.</p>
<p>So they want to train DRL from samples acquired from some other control of the system (in their case, human interaction data), much like <a href="https://arxiv.org/abs/1704.03732">Deep Q-learning from Demonstrations</a>. There are a couple of reasons this is important for others such as myself:</p>
<blockquote><p>First, since collecting real-world interaction data can be expensive and time-consuming, algorithms must be able to leverage off-policy data - collected from vastly different systems, far into the past - in order to learn.</p>
<p>Second, it is often necessary to carefully test a policy before deploying it to the real world; for example, to ensure its behavior is safe and appropriate for humans. Thus the algorithm must be able to learn offline first, from a static batch of data, without the ability to explore</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="A-generative-model-+-Q-learning">A generative model + Q learning<a class="anchor-link" href="#A-generative-model-+-Q-learning">¶</a></h1><p>The authors first pre-train a generative model on the distribution of collected trajectories, and initialize the Q networks from this model. They then sample a fixed number of actions from it, and output the one with the highest Q-value as their policy's decision. In later reinforcement learning, they penalize their model for KL-divergence from this distribution.</p>
<blockquote><p>To perform batch Q-learning, we first pre-train a generative model of $p(a|s)$ using a set of known environment trajectories. In our case, this model is then used to generate the batch data via human interaction. The weights of the Q-network and target Q-network are initialized from the pre-trained model, which helps reduce variance in the Q-estimates and works to combat overestimation bias. To train $Q_{θ_π}$ we sample < $s_t$, $a_t$, $r_t$, $s_{t+1}$ > tuples from the batch, and update the weights of the Q-network to approximate Eq. 1. This forms our baseline model, which we call Batch Q</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Overestimation-bias">Overestimation bias<a class="anchor-link" href="#Overestimation-bias">¶</a></h1><blockquote><p>Most deep RL algorithms fail to learn from data that is not heavily correlated with the current policy. Even models based on off-policy algorithms lik Q-learning fail to learn when the model is not able to explore during training. This is due to the fact that such algorithms are inherently optimistic in the face of uncertainty.</p>
</blockquote>
<p>If you’re taking the <code>max</code> of something (as in Bellman-equation-based algorithms), then the higher the variance, the higher the <code>max</code> value. This causes an over-estimation bias. We may have seen a really high value for some state once, so now we over-value that state, despite it being atypical. It may not be immediately obvious why this is a <em>problem</em>, but which states are we likely to overvalue? Precisely the states we haven't visited often. Why is <em>that</em> a problem? This sounds good for exploration, right? But if we're trying to train our agent with canned data, it's important that the live agent stick pretty close to the states where the canned data does well, and it's counter-productive to have it believe that everywhere <em>but</em> the pre-explored state space is worth exploring.</p>
<p>A popular solution to the overestimation problem in Q-learning algorithms is to train <em>two</em> Q networks on the same data, put the input through both, and take the minimum value. This helps with the bias because they'll likely disagree unless we can be really <em>certain</em> of the value of the input, and if they disagree we can go with the least confident. The authors of the current paper take a different tack, training a single neural net with dropout, and using the disagreement with different dropout masks as an estimate of uncertainty.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li><p>I didn't talk much about their model architecture, which is "Variational Hierarchical Recurrent Encoder Decoder (VHRED)", largely because I think if I ever tried to make use of this directly I would employ transformers instead. They do mention that transformer architectures are a "powerful alternative", but they chose to work with hierarchical architectures so they could extend their work to hierarchical control in the future. That's interesting. In my own work at the moment, the important thing is the "way off-policy" part, not so much the chat bot part.</p>
</li>
<li><p>It's very interesting to me that both of the methods for correcting overestimation bias make use of uncertainty estimators that I've seen mentioned elsewhere:</p>
<ul>
<li><a href="https://arxiv.org/abs/1905.09638">Estimating Risk and Uncertainty in Deep Reinforcement Learning</a></li>
</ul>
<blockquote><p>...we show that the disagreement between only two neural networks is sufficient to produce a low-variance estimate of the epistemic uncertainty on the return distribution, thus providing a simple and computationally cheap uncertainty metric.</p>
</blockquote>
<ul>
<li><a href="https://arxiv.org/abs/1506.02142">Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning</a></li>
</ul>
<blockquote><p>...we develop a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes. A direct result of this theory gives us tools to model uncertainty with dropout NNs</p>
</blockquote>
</li>
<li><p>This article wasn't really shorter than if I had done multiple papers, less deeply. I'll have to practice at that, not least because it's time-consuming, but information is valuable. How does Adrian Colyer do this every <em>day</em>?</p>
</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
A New Series arXiv Sampler2019-07-07T00:00:00-04:002019-07-07T00:00:00-04:00Daniel Coxtag:computable.ai,2019-07-07:/articles/2019/Jul/07/a-new-series-arxiv-sampler.html<p>Beginning a new series highlighting a few interesting RL papers on the arXiv each week. This week: Simple curriculum learning, learning to interact with humans, and warm starting RL with propositional logic.</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="New-series">New series<a class="anchor-link" href="#New-series">¶</a></h1><p>This post begins a weekly series highlighting one or more RL papers in the previous week's cs.AI arXiv stream that caught my eye (making no guarantees about the correlation between what catches my eye and what ultimately turns out to be useful, important, etc). I'll be prioritizing sustainability over most other factors, but I do hope to show you some code from time to time.</p>
<p>I read these papers to differing degrees as I have time, so there will likely be some variability in descriptive volume. However, I do pledge to make only justified statements about them so far as I know, and I welcome errata in the comments. I'm still experimenting with the format and voice, so please leave me feedback early and often to influence the series.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="This-week">This week<a class="anchor-link" href="#This-week">¶</a></h1><p>All of this week's papers piqued my interest because of the sample-efficiency problem in modern DRL. Reinforcement learning algorithms need to interact with the environment quite a bit before they become good at a task, and anything that can shorten this time is of interest. My group is currently working on a learning task with a very low sample rate, so we are actively on the hunt for anything that improves sample efficiency.</p>
<ul>
<li><a href="https://arxiv.org/abs/1906.12266">Growing Action Spaces</a>, by Farquhar et al. at Oxford and Facebook AI Research.</li>
<li><a href="https://arxiv.org/abs/1906.10187v2">Learning to Interactively Learn and Assist</a>, by Woodward et al. at Google Brain.</li>
<li><a href="https://arxiv.org/abs/1902.06007v2">ProLoNets: Neural-encoding Human Experts' Domain Knowledge to Warm Start Reinforcement Learning</a>, by Silva et al. at Georgia Institute of Technology.</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Growing-Action-Spaces">Growing Action Spaces<a class="anchor-link" href="#Growing-Action-Spaces">¶</a></h2><p>Growing Action Spaces proposes a form of "curriculum learning", where a more complex task is broken down into a sequence of simpler tasks, sometimes by humans, sometimes automatically. In this case, the authors improved the learning speed of their agent by initially giving it fewer actions to work with, training for a while, and then alternating between giving it more actions to work with and training.</p>
<p>Interestingly, they were working in Starcraft, which is a real-time strategy (RTS) game, where you have to control multiple units simultaneously in a coordinated fashion to achieve some goal. Thus, in their domain, the size of the action space didn't just come from continuity or a really large discrete action space, but from the fact that the actions they were capable of taking were <em>combinatorial</em>. That is, they had to train an agent to take actions from a space including any combination of primitive actions, as well as any combinations of units; a daunting task.</p>
<p>Their solution is brilliant, and highly general: The authors broke the action space up into a hierarchy of action spaces by grouping units, and requiring that the same action be taken by all units within the same group. Then as training progressed, more groups were allowed to act independently. This resulted in a tractable problem at each stage of training, and overall high-performance policies that would have been prohibitively complex with conventional DRL algorithms.</p>
<p>If you or I want to apply this method to our own problems, the key requirement is to come up with a suitable way of breaking large action spaces into hierarchies of progressively smaller ones.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Learning-to-Interactively-Learn-and-Assist">Learning to Interactively Learn and Assist<a class="anchor-link" href="#Learning-to-Interactively-Learn-and-Assist">¶</a></h2><p>Reinforcement learning typically depends on a sparse reward signal and random exploration, both of which contribute to poor sample efficiency in modern algorithms. One method of improving sample efficiency and solving the exploration problem is imitation learning, where the agent is pre-trained to mimic expert behavior. However, expert demonstrations are expensive, and it's often difficult to know how much and of what kind will suffice. These are the problems Learning to Interactively Learn and Assist attempts to solve by proposing a different paradigm entirely: without explicit demonstrations or reward function.</p>
<p>The goal is for an agent and a "principal" (say, a human) to learn to work together to accomplish the principal's purpose. The agent takes its cues from the principal's behavior, and acts helpfully. This requires prior understanding, both of the environment and of what constitutes communication from the principal.</p>
<p>To get to this point, the authors trained an agent jointly with a "human surrogate" principal on a variety of tasks in the same environment. Each time, the principal knows the task (as part of its observation input), and the agent does not. They receive a joint reward at the end of the episode.</p>
<blockquote><p>By informing the principal of the current task and withholding rewards and gradient updates until the end of each task, the agents are encouraged to emerge interactive learning behaviors in order to inform the assistant of the task and allow them to contribute to the joint reward.</p>
</blockquote>
<p>Prior domain knowledge required to jointly accomplish a given task is trained into the agent ahead of time this way, along with the methods of communication. Actions and observations are restricted to the environment, so that later the principal may be replaced with a human.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="ProLoNets:-Neural-encoding-Human-Experts'-Domain-Knowledge-to-Warm-Start-Reinforcement-Learning">ProLoNets: Neural-encoding Human Experts' Domain Knowledge to Warm Start Reinforcement Learning<a class="anchor-link" href="#ProLoNets:-Neural-encoding-Human-Experts'-Domain-Knowledge-to-Warm-Start-Reinforcement-Learning">¶</a></h2><p>ProLoNets stands for "Propositional Logic Nets", which are a neural network architecture and method of initialization that allows a domain expert to encode initial behavior for a DRL agent in the form of propositional logic.</p>
<p>To give you the flavor:</p>
<blockquote><p>To illustrate this more practically, we consider the simplest case of a cart pole ProLoNet with a single decision node. Assume we have solicited the following from a domain expert: "If the cart's $x$ position is right of center, move left; otherwise, move right," and that they indicate <code>x_position</code> is the first input feature and that the center is at 0. We therefore initialize our primary node $D_0$ with $w_0=[1,0,0,0]$ and $c_0=0$. We then specify $l_0$ to be a new leaf with a prior of $[1,0]$. Finally, we set the path to $l_0$ to be $D_0$ and the path $l_1$ to be $(1-D_0)$. Consequently for each state, the probability distribution over the agent's two actions is a softmax over $(D_0*l_0+(1-D_0)*l_1)$</p>
</blockquote>
<p>I've barely skimmed this paper so I don't know what each of the components means, but I gather that a human-authored decision tree can be translated directly into a correctly-initialized neural network architecture, and an actor-critic algorithm takes over from there to improve beyond the human expert's baseline.</p>
<p>Something else that caught my eye:</p>
<blockquote><p>While our initialized ProLoNets are able to follow expert strategies immediately, they may lack expressive capacity to learn more optimal policies once they are deployed into a domain. ... To enable the ProLoNet architecture to continue to grow beyond its initial definition, we introduce a dynamic deepening procedure.</p>
<p>Upon initialization, a ProLoNet agent maintains two copies of its actor: the shallower, unaltered initialized version and a deeper version, in which each leaf is transformed into a randomly initialized node with two new randomly initialized leaves. As the agent interacts with its environment, it relies on the shallower networks to generate actions and value predictions and to gather experience, After each episode, our off-policy update is run over the shallower and deeper networks. Finally, after the off-policy updates, the agent compares the entropy of the shallower actor's leaves to the entropy of the deeper actor's leaves and selectively deepens when the leaves of the deeper actor are less uniform than those of the shallower actor. We find that this dynamic deepening improves stability and ameliorates policy degradation.</p>
</blockquote>
<p>This strikes me as the beginning of the future, where neural network architecture is learned and adjusted dynamically alongside the network parameters.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Parting-thoughts">Parting thoughts<a class="anchor-link" href="#Parting-thoughts">¶</a></h1><ol>
<li>I'm extremely pleased to have finally gotten this off the ground. Please comment on anything and everything, and we'll drive this thing together.</li>
<li>Growing Action Spaces is immediately relevant to my group, since in the medium-term, we intend to increase our action spaces combinatorially, and will inherit all of the trouble this brings. More on this another time.</li>
<li>I wonder how often in complex real environments the "Learning to Interactively Learn and Assist" agents will learn to communicate in a way that humans find unintuitive. Since the quickest way to communicate involves some compression, would we need to add some term representing human understandability? How best to do this?</li>
<li>"Learning to Interactively Learn and Assist" seems like a relevant paper for AI safety, though as far as I could tell in my quick read, it wasn't billed that way. If we train agents that don't have goals of their own necessarily, but take their cues from us in real time, are we safer than if we attempted to craft the perfect reward function, or demonstrated our desires in a one-and-done fashion?</li>
<li>I've gotta actually read the ProLoNets paper. There was even more to it than I highlighted, and they included an ablation study which will likely tell me if I can incorporate their concepts piecemeal into my own work.</li>
</ol>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" linebreaks: { automatic: true, width: '95% container' }, " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>