# This week¶

Just a sketch this week, of Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies.

I've been hearing hierarchical RL mentioned frequently lately, and while I understand it's a way to encode human expertise to achieve otherwise intractible goals, it has also seemed a bit like cheating. However, I have a day job, and this serves as a healthy dose of pragmatism. I also think that even when the goal is fundamental progress, it's often a good idea to achieve the goal in any way possible, and then follow-up by working the cheats out of the system one by one. So when I read the abstract of this paper, I was feeling more receptive than previously.

Part of what made hierarchical RL seem not worth the cheating was how kludgy and inefficient the usual methods were, retraining a whole new policy from scratch for each subtask. That's why this week's paper caught my eye:

... we propose an algorithm for learning both compound and composable policies within the same learning process by exploiting the off-policy data generated from the compound policy.

Their resulting algorithm, "Hierarchical Intentional-Unintentional Soft Actor-Critic" (HIU-SAC), efficiently trains all sub-policies simultaneously, choosing actions to perform in the environment using a weighted average of the "votes" of all sub-policies, with weights given by a learned selector network (which is also simultaneously trained).

# Composable hierarchical RL¶

## Architecture¶

The composite policy consists of the individual policy networks, each with its own reward function, trained to take observations $s$ in and output parameters of a conditional Gaussian. There is also a special activation vector selector network trained on the same states to produce weights corresponding to how much each constituent policy applies to the current state. All of these networks share early layers, since they all benefit from an accurate high-level state representation. Finally, some function $f$ takes all of these outputs and determines what action $a$ to actually take in the environment.

The Q function networks are similarly arranged, sharing early layers which take a state $s$ and an action $a$ to produce a Q function for each subtask, as well as a composite Q function.

## Simultaneous learning¶

Most methods learn the composable tasks one at a time, and later, the compound task. This procedure is not scalable as all the experience collected for each learning process is only used for that specific process. Also, it is not possible to start learning more complex tasks unless all the compos- able policies have been successfully learned. The method proposed in this section is based on the idea that a single stream of experience can be used to improve not only the policy that is generating the behavior but also, indirectly, many other policies.

The authors refer to the composite policy acting as the "intentional" policy (the "behavior" policy in an off-policy setting), and the composable sub-policies as the "unintentional" policies (each one a "target" policy in an off-policy setting). They use a variation on SAC to train the composite and composable policies simultaneously within the maximum entropy framework.

The objective function for the Q networks simply maximize the expected sum of all mean-squared Bellman errors for each Q network, for each tuple in the replay buffer $\mathcal{D}$. The objective function for the policy is simply the sum of the objective functions for each intentional and unintentional policy. Each policy objective optimizes the expected difference for each state in $\mathcal{D}$ between the Q value and log-probability of the selected action (adjustable by temperature $\alpha$), over all possible actions. HIU-SAC then alternates between policy evaluation and policy improvement steps following SAC.

## The importance of maximizing entropy to adequate exploration¶

It is interesting that the entropy-maximizing RL objective was absolutely necessary for exploring broadly enough to train all of these policies at once.

Note that populating the replay memory buffer with rich experiences is essential for acquiring multiple skills in an off-policy manner. The composable policies learned unintentionally had similar performance than the policies obtained in single-task formulations only when the compound policy was able to efficiently explore the environment. For this reason, the algorithm was built on a maximum entropy RL framework to favor exploration during the learning process.

# Parting thoughts¶

1. In a way, the methods proposed here seem rather obvious, and I found this paper quite easy to understand given that it violated none of my expectations. I also haven't been paying enough attention to hierarchical RL to know off-hand why training the sub-policies in parallel off of the same recorded environment interactions hasn't been tried before (or whether it has been without my notice). Perhaps it was necessary for off-policy RL to reach a level of maturity sufficient for sub-policies to see enough relevant data to train? In any case, don't hear me faulting the authors for trying the obvious. It is relieving a non-obvious that a straightforward formulation works so well.
2. I'd love to see this work combined with imitation learning and inverse RL to figure out what sub-policies are necessary in the first place from demonstrations. That seems like a very practical framework for real-world learning.