# This (last) week¶

Alas, I bit off more than I could chew last week. You'll see what I mean in a moment. However, I've decided to define the problem away, as part of an effort to more effectively juggle all of my life responsibilities:

**ArXiv Highlights will be bi-weekly** from here on out. I'm also going to be a *little* less strict about when I sample papers from, so that I don't feel so constrained to do "last week's" arXiv announcements. The attentive reader may have noticed that I've already occasionally sampled from outside of the week's announcements, and I'd actually prefer to do that more often so that I can hit *key* papers instead of just *new* papers.

I couldn't just pick one paper last week, since so many seemed relevant and interesting. Therefore I'm experimenting with yet another format for arXiv highlights: posting all the abstracts, and commenting a bit on each one. The goal is to work each of these concepts into my memory (and yours) so that they'll spring to mind when we need them.

In arXiv announcement order:

- Emergent Tool Use From Multi-Agent Autocurricula
- Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning?
- Brain-Inspired Hardware for Artificial Intelligence: Accelerated Learning in a Physical-Model Spiking Neural Network
- Pre-training as Batch Meta Reinforcement Learning with tiMe
- Reinforcement Learning with Chromatic Networks
- Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery
- Model Imitation for Model-Based Reinforcement Learning
- Zermelo's problem: Optimal point-to-point navigation in 2D turbulent flows using Reinforcement Learning

# 1. Emergent Tool Use From Multi-Agent Autocurricula¶

Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.

https://arxiv.org/abs/1909.07528v1

I notice that OpenAI, DeepMind, and Google Brain are involved in a lot of the interesting work in reinforcement learning lately, and many of the papers that catch my eye have at least some authors from either of these organizations. I'm an aspiring Bayesian, so it wasn't *too* long before I starting reading papers *because* they were authored by one of these organizations.

Anyway, the term "autocurriculum" seems to come from this DeepMind paper:

Here we explore the hypothesis that multi-agent systems sometimes display intrinsic dynamics arising from competition and cooperation that provide a naturally emergent curriculum, which we term an autocurriculum.

This gives me a word for something I've observed about my young son: The activities he's naturally inclined to engage in at each stage of his development seem uncannily well-suited for teaching him the *next* thing he should learn. Wanting to put things in his mouth, plus a capacity for boredom, motivated him to develop reaching and grabbing, then crawling, then pathfinding, then complex navigation...

In the case of the OpenAI paper, putting multiple adversarial agents into complex environments and allowing them to learn causes them to learn new behaviors *in phases*.

We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt

Each time a team of agents learns a dominant strategy, the opposing team is pressured to develop a strategy capable of defeating it, which then pressures the first team to come up with *another* stretegy to defeat *that* one, and on and on until a truly dominant strategy emerges.

My takeaway from this is that emergent autocurricula may be another good reason for me to study multi-agent systems.

This paper comes with a nice blog post and a cute video: https://openai.com/blog/emergent-tool-use/

# 2. Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning?¶

Hierarchical reinforcement learning has demonstrated significant success at solving difficult reinforcement learning (RL) tasks. Previous works have motivated the use of hierarchy by appealing to a number of intuitive benefits, including learning over temporally extended transitions, exploring over temporally extended periods, and training and exploring in a more semantically meaningful action space, among others. However, in fully observed, Markovian settings, it is not immediately clear why hierarchical RL should provide benefits over standard "shallow" RL architectures. In this work, we isolate and evaluate the claimed benefits of hierarchical RL on a suite of tasks encompassing locomotion, navigation, and manipulation. Surprisingly, we find that most of the observed benefits of hierarchy can be attributed to improved exploration, as opposed to easier policy learning or imposed hierarchical structures. Given this insight, we present exploration techniques inspired by hierarchy that achieve performance competitive with hierarchical RL while at the same time being much simpler to use and implement.

https://arxiv.org/abs/1909.10618v1

The big finding here is that "most of the observed benefits of hierarchy can be attributed to improved exploration".

This is not the first time I've heard that a complicated technique in RL has been studied and found to boil down to better exploration or more even coverage of the state space. Diagnosing Bottlenecks in Deep Q-learning Algorithms contained a similar revelation about replay buffer sampling, for example, and I get the impression that the maximum entropy RL framework seems to be overtaking more ad hoc trust region methods such as PPO. Anyway, this is why theory is important even to mere industry practitioners. As theory catches up to practice, we learn *why* things work, and the answers are often surprising and useful.

# 3. Brain-Inspired Hardware for Artificial Intelligence: Accelerated Learning in a Physical-Model Spiking Neural Network¶

Future developments in artificial intelligence will profit from the existence of novel, non-traditional substrates for brain-inspired computing. Neuromorphic computers aim to provide such a substrate that reproduces the brain's capabilities in terms of adaptive, low-power information processing. We present results from a prototype chip of the BrainScaleS-2 mixed-signal neuromorphic system that adopts a physical-model approach with a 1000-fold acceleration of spiking neural network dynamics relative to biological real time. Using the embedded plasticity processor, we both simulate the Pong arcade video game and implement a local plasticity rule that enables reinforcement learning, allowing the on-chip neural network to learn to play the game. The experiment demonstrates key aspects of the employed approach, such as accelerated and flexible learning, high energy efficiency and resilience to noise.

https://arxiv.org/abs/1909.11145v1

This paper was presented at ICANN 2019, and published in Lecture Notes in Computer Science. In case it's unclear what's going on here: The authors built a small-scale prototype (32 neurons, 32 synapses each) of an apparently *analog* hardware simulation of a biological learning model of the brain (STDP). They then used it to a) simulate a simplified Pong (on-chip), and b) successfully learn to play using reinforcement learning (again, on-chip). Emulating their own system on an Intel i7-4771 was an order of magnitude slower, so we're talking about a real improvement. This is an auspicious beginning, and they hint at scaled-up work to come.

I look forward to specialized neuronal hardware. I'm especially interested to hear that they simulated actual neurons to some degree, with spike-timing dependence, rather than the simplified model that I'm used to working with. I expect this means they intend to simulate actual brains at some point. Stay tuned.

# 4. Pre-training as Batch Meta Reinforcement Learning with tiMe¶

Pre-training is transformative in supervised learning: a large network trained with large and existing datasets can be used as an initialization when learning a new task. Such initialization speeds up convergence and leads to higher performance. In this paper, we seek to understand what the formalization for pre-training from only existing and observational data in Reinforcement Learning (RL) is and whether it is possible. We formulate the setting as Batch Meta Reinforcement Learning. We identify MDP mis-identification to be a central challenge and motivate it with theoretical analysis. Combining ideas from Batch RL and Meta RL, we propose tiMe, which learns distillation of multiple value functions and MDP embeddings from only existing data. In challenging control tasks and without fine-tuning on unseen MDPs, tiMe is competitive with state-of-the-art model-free RL method trained with hundreds of thousands of environment interactions.

https://arxiv.org/abs/1909.11373v1

This paper attempts to bring the benefits of pretraining (on some pre-recorded batch) to reinforcement learning. This is non-trivial, since Q-learning algorithms are known to be unstable on batches produced by "foreign policy" (my phrase).

The value function diverges if Q fails to accurately estimate the value of $\pi(s')$

This is mitigated by online Q-learning algorithms because the contents of the replay buffer, while produced by an off-policy algorithm, was at least produced through interaction with the environment, and so the distribution of the induced $\pi$ doesn't deviate too much from the distribution in the replay buffer. Even then, this phenomenon is still a source of instability for Q-learning.

In batch learning, the problem is worse. The recorded batch was *not* produced by our induced policy, and perhaps not even by a *single* policy. Further, the environment reflected in the batch may not even have been produced by a single Markov decision process.

I'm interested in *this* paper because the authors apply meta RL to the problem, and claim to achieve good performance on unseen MDPs sampled from the same family as those represented by the training batch. If that's so, it has positive implications for my own work.

# 5. Reinforcement Learning with Chromatic Networks¶

We present a neural architecture search algorithm to construct compact reinforcement learning (RL) policies, by combining ENAS and ES in a highly scalable and intuitive way. By defining the combinatorial search space of NAS to be the set of different edge-partitionings (colorings) into same-weight classes, we represent compact architectures via efficient learned edge-partitionings. For several RL tasks, we manage to learn colorings translating to effective policies parameterized by as few as 17 weight parameters, providing >90% compression over vanilla policies and 6x compression over state-of-the-art compact policies based on Toeplitz matrices, while still maintaining good reward. We believe that our work is one of the first attempts to propose a rigorous approach to training structured neural network architectures for RL problems that are of interest especially in mobile robotics with limited storage and computational resources.

https://arxiv.org/abs/1907.06511v2

From the introduction:

The main question we tackle in this paper is the following:

Are high dimensional architectures necessary for encoding efficient policies and if not, how compact can they be in in practice?

More compact achitectures not only take less space, but also produce inferences more quickly and cheaply. This matters to me because my work is often done on cloud computing infrastructure, which incentivizes parsomony. I'm also professionally interested in neural architecture search for multi-task scaling purposes. More on this later, perhaps.

The authors find compact policies by jointly optimizing the RL objective and "the combinatorial nature of the network’s parameter sharing profile". Inspired by two other papers, they reduce the number of distinct weights by *sharing* a single weight between multiple neuronal connections. The first paper from which their inspiration for this arises used Toeplitz matrices to represent the neural network, and the second randomly assigns weights (Weight-Agnostic Neural Networks, or WANNs) and then learns the connection topology to maximize an RL goal.

WANNs replace conceptually simple feedforward networks with general graph topologies using NEAT algorithm providing topological operators to build the network.

Our approach is a middle ground, where the topology is still a feedforward neural network, but the weights are partitioned into groups that are being learned in a combinatorial fashion using rein- forcement learning. While 10 shares weights randomly via hashing, we learn a good partitioning mechanisms for weight sharing.

How do they do this?

We leverage recent advances in the ENAS (Efficient Neural Architecture Search) literature and theory of pointer networks to optimize over the combinatorial component of this objective and state of the art evolution strategies (ES) methods to optimize over the RL objective.

Our key observation is that ENAS and ES can naturally be combined in a highly scalable but conceptually simple way.

Ah. So... *how* do they do this?

We'll both just have to read the whole paper. In my light read, I notice this one is so full of interesting insights and pointers to important results from other research that it's worth our time. Basically though, they alternate between neural architecture search and RL optimization, using their own ENAS variant to optimize a pointer network capable of partitioning weights to be shared.

# 6. Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery¶

Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both on a real-world robot and in simulation. We show that our method can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and just ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: https://sites.google.com/view/dynamical-distance-learning

https://arxiv.org/abs/1907.08225v2

I picked up this paper partly because reward shaping is currently of professional interest to me, but also because I'm watching Haarnoja for his work on distributional RL.

This paper is about making reward-shaping easier by learning a more direct distance measure for the purpose. In general, if you know your distance from a goal, there are many optimization methods available to you for reducing that distance and achieving your goal. The better this distance measure, the smoother the landscape, and the more quickly you arrive.

The semi-supervised way_in which they approach this problem also strikes me as relevant to AI safety, a topic of personal interest. With the unsupervised training of their dynamical distance measure, they add "a small amount of preference supervision" to set the task goal, and this results in its achievement. A manually-specified reward function is very dangerous, and I'm interested in any novel methods that avoid their direct use (or, more to the point, I'm interested in methods of motivating AIs which more directly align holistic human flourishing with the AI's objectives).

For a quick overview, don't miss the link they posted at the end of the abstract.

# 7. Model Imitation for Model-Based Reinforcement Learning¶

Model-based reinforcement learning (MBRL) aims to learn a dynamic model to reduce the number of interactions with real-world environments. However, due to estimation error, rollouts in the learned model, especially those of long horizon, fail to match the ones in real-world environments. This mismatching has seriously impacted the sample complexity of MBRL. The phenomenon can be attributed to the fact that previous works employ supervised learning to learn the one-step transition models, which has inherent difficulty ensuring the matching of distributions from multi-step rollouts. Based on the claim, we propose to learn the synthesized model by matching the distributions of multi-step rollouts sampled from the synthesized model and the real ones via WGAN. We theoretically show that matching the two can minimize the difference of cumulative rewards between the real transition and the learned one. Our experiments also show that the proposed model imitation method outperforms the state-of-the-art in terms of sample complexity and average return.

https://arxiv.org/abs/1909.11821v1

AGI seems likely to be model-based, rather than model-free. I think this because I (an AGI) personally reuse my own models all the time, frequently attempt near-transfer to solve some novel problem. So anything that claims progress on model-based learning is at least worth a look to me.

Earlier I blogged about Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy, and looking at it now, I'm surprised I only *alluded* to their use of Transformers. In that paper, they learn to imitate a past trajectory by mimicking the trajectory distribution, conditioned on a past trajectory. *This* paper wants to create a model of the environment that similarly mimics its distribution, but they use Wasserstein GANs (WGANs) instead. GANs have been wildly successful in generative image models, and WGANs are an especially promising variant. I've been keeping an eye out for papers that use GANs in areas outside computer vision.

If you know what a WGAN is and you understand that the authors are trying to get a WGAN to mimic the environment's bounded trajectory segment transition distribution, then you can imagine what they're doing. They also provide a theoretical bound for the expected distributional error.

I'll mention in closing that in their experiments, they also end up doing better than most other methods, using 50% fewer samples. That it works at all suggests to me that the model is sufficiently accurate to take notice. GANs are interesting.

# 8. Zermelo's problem: Optimal point-to-point navigation in 2D turbulent flows using Reinforcement Learning¶

To find the path that minimizes the time to navigate between two given points in a fluid flow is known as Zermelo's problem. Here, we investigate it by using a Reinforcement Learning (RL) approach for the case of a vessel which has a slip velocity with fixed intensity, Vs , but variable direction and navigating in a 2D turbulent sea. We show that an Actor-Critic RL algorithm is able to find quasi-optimal solutions for both time-independent and chaotically evolving flow configurations. For the frozen case, we also compared the results with strategies obtained analytically from continuous Optimal Navigation (ON) protocols. We show that for our application, ON solutions are unstable for the typical duration of the navigation process, and are therefore not useful in practice. On the other hand, RL solutions are much more robust with respect to small changes in the initial conditions and to external noise, even when V s is much smaller than the maximum flow velocity. Furthermore, we show how the RL approach is able to take advantage of the flow properties in order to reach the target, especially when the steering speed is small.

https://arxiv.org/abs/1907.08591v2

This paper is a pure personal indulgence. I read James Gleick's Chaos and got interested in dynamical systems theory. I don't know much, but I do know turbulent flows are a pain to predict, which I assume would mean they're a pain to navigate within. I've also heard that neural networks do pretty surprisingly well at predicting chaotic dynamics, so I'm interested to see it "applied". The paper brings up several other examples of successful neural navigation and prediction:

Promising results have been obtained when applying RL algorithms to similar problems, such as the training of smart inertial particles or swimming particles navigat- ing intense vortex regions [31], Taylor Green flows [32] and ABC flows [33]. RL has also been successfully imple- mented to reproduce schooling of fishes [34, 35], soaring of birds in a turbulent environments [36, 37] and in many other applications [38–40]. Similarly, in the recent years, artificial intelligence techniques are establishing them- selves as new data driven models for fluid mechanics in general [41–46].

I can't state their results better than they can, so here you go:

In this paper, we show that for the case of vessels that have a slip velocity with fixed intensity but variable di- rection, RL can find a set of quasi-optimal paths to efficiently navigate the flow. Moreover, RL, unlike ON, can provide a set of highly stable solutions, which are insensitive to small disturbances in the initial condition and successful even when the slip velocity is much smaller than the guiding flow. We also show how the RL protocol is able to take advantage of different features of the underlying flow in order to achieve its task, indicating that the information it learns is non-trivial.

# Parting thoughts¶

- Surprisingly, even this list of eight do not cover
*all*of the papers that sounded interesting to me. It was*quite*a good week for announcements on the arXiv. - This is the format I originally had in mind for arXiv highlights, but since the abstracts tend to invite questions that I can't answer without at least skimming the paper, I ended up reading them more thoroughly. With this format, I can cover more ground, but less deeply.