# New series¶

This post begins a weekly series highlighting one or more RL papers in the previous week's cs.AI arXiv stream that caught my eye (making no guarantees about the correlation between what catches my eye and what ultimately turns out to be useful, important, etc). I'll be prioritizing sustainability over most other factors, but I do hope to show you some code from time to time.

I read these papers to differing degrees as I have time, so there will likely be some variability in descriptive volume. However, I do pledge to make only justified statements about them so far as I know, and I welcome errata in the comments. I'm still experimenting with the format and voice, so please leave me feedback early and often to influence the series.

# This week¶

All of this week's papers piqued my interest because of the sample-efficiency problem in modern DRL. Reinforcement learning algorithms need to interact with the environment quite a bit before they become good at a task, and anything that can shorten this time is of interest. My group is currently working on a learning task with a very low sample rate, so we are actively on the hunt for anything that improves sample efficiency.

## Growing Action Spaces¶

Growing Action Spaces proposes a form of "curriculum learning", where a more complex task is broken down into a sequence of simpler tasks, sometimes by humans, sometimes automatically. In this case, the authors improved the learning speed of their agent by initially giving it fewer actions to work with, training for a while, and then alternating between giving it more actions to work with and training.

Interestingly, they were working in Starcraft, which is a real-time strategy (RTS) game, where you have to control multiple units simultaneously in a coordinated fashion to achieve some goal. Thus, in their domain, the size of the action space didn't just come from continuity or a really large discrete action space, but from the fact that the actions they were capable of taking were combinatorial. That is, they had to train an agent to take actions from a space including any combination of primitive actions, as well as any combinations of units; a daunting task.

Their solution is brilliant, and highly general: The authors broke the action space up into a hierarchy of action spaces by grouping units, and requiring that the same action be taken by all units within the same group. Then as training progressed, more groups were allowed to act independently. This resulted in a tractable problem at each stage of training, and overall high-performance policies that would have been prohibitively complex with conventional DRL algorithms.

If you or I want to apply this method to our own problems, the key requirement is to come up with a suitable way of breaking large action spaces into hierarchies of progressively smaller ones.

## Learning to Interactively Learn and Assist¶

Reinforcement learning typically depends on a sparse reward signal and random exploration, both of which contribute to poor sample efficiency in modern algorithms. One method of improving sample efficiency and solving the exploration problem is imitation learning, where the agent is pre-trained to mimic expert behavior. However, expert demonstrations are expensive, and it's often difficult to know how much and of what kind will suffice. These are the problems Learning to Interactively Learn and Assist attempts to solve by proposing a different paradigm entirely: without explicit demonstrations or reward function.

The goal is for an agent and a "principal" (say, a human) to learn to work together to accomplish the principal's purpose. The agent takes its cues from the principal's behavior, and acts helpfully. This requires prior understanding, both of the environment and of what constitutes communication from the principal.

To get to this point, the authors trained an agent jointly with a "human surrogate" principal on a variety of tasks in the same environment. Each time, the principal knows the task (as part of its observation input), and the agent does not. They receive a joint reward at the end of the episode.

By informing the principal of the current task and withholding rewards and gradient updates until the end of each task, the agents are encouraged to emerge interactive learning behaviors in order to inform the assistant of the task and allow them to contribute to the joint reward.

Prior domain knowledge required to jointly accomplish a given task is trained into the agent ahead of time this way, along with the methods of communication. Actions and observations are restricted to the environment, so that later the principal may be replaced with a human.

## ProLoNets: Neural-encoding Human Experts' Domain Knowledge to Warm Start Reinforcement Learning¶

ProLoNets stands for "Propositional Logic Nets", which are a neural network architecture and method of initialization that allows a domain expert to encode initial behavior for a DRL agent in the form of propositional logic.

To give you the flavor:

To illustrate this more practically, we consider the simplest case of a cart pole ProLoNet with a single decision node. Assume we have solicited the following from a domain expert: "If the cart's $x$ position is right of center, move left; otherwise, move right," and that they indicate x_position is the first input feature and that the center is at 0. We therefore initialize our primary node $D_0$ with $w_0=[1,0,0,0]$ and $c_0=0$. We then specify $l_0$ to be a new leaf with a prior of $[1,0]$. Finally, we set the path to $l_0$ to be $D_0$ and the path $l_1$ to be $(1-D_0)$. Consequently for each state, the probability distribution over the agent's two actions is a softmax over $(D_0*l_0+(1-D_0)*l_1)$

I've barely skimmed this paper so I don't know what each of the components means, but I gather that a human-authored decision tree can be translated directly into a correctly-initialized neural network architecture, and an actor-critic algorithm takes over from there to improve beyond the human expert's baseline.

Something else that caught my eye:

While our initialized ProLoNets are able to follow expert strategies immediately, they may lack expressive capacity to learn more optimal policies once they are deployed into a domain. ... To enable the ProLoNet architecture to continue to grow beyond its initial definition, we introduce a dynamic deepening procedure.

Upon initialization, a ProLoNet agent maintains two copies of its actor: the shallower, unaltered initialized version and a deeper version, in which each leaf is transformed into a randomly initialized node with two new randomly initialized leaves. As the agent interacts with its environment, it relies on the shallower networks to generate actions and value predictions and to gather experience, After each episode, our off-policy update is run over the shallower and deeper networks. Finally, after the off-policy updates, the agent compares the entropy of the shallower actor's leaves to the entropy of the deeper actor's leaves and selectively deepens when the leaves of the deeper actor are less uniform than those of the shallower actor. We find that this dynamic deepening improves stability and ameliorates policy degradation.

This strikes me as the beginning of the future, where neural network architecture is learned and adjusted dynamically alongside the network parameters.

# Parting thoughts¶

1. I'm extremely pleased to have finally gotten this off the ground. Please comment on anything and everything, and we'll drive this thing together.
2. Growing Action Spaces is immediately relevant to my group, since in the medium-term, we intend to increase our action spaces combinatorially, and will inherit all of the trouble this brings. More on this another time.
3. I wonder how often in complex real environments the "Learning to Interactively Learn and Assist" agents will learn to communicate in a way that humans find unintuitive. Since the quickest way to communicate involves some compression, would we need to add some term representing human understandability? How best to do this?
4. "Learning to Interactively Learn and Assist" seems like a relevant paper for AI safety, though as far as I could tell in my quick read, it wasn't billed that way. If we train agents that don't have goals of their own necessarily, but take their cues from us in real time, are we safer than if we attempted to craft the perfect reward function, or demonstrated our desires in a one-and-done fashion?
5. I've gotta actually read the ProLoNets paper. There was even more to it than I highlighted, and they included an ablation study which will likely tell me if I can incorporate their concepts piecemeal into my own work.