# This week¶

Only one paper this week, not because others failed to catch my eye, but for brevity. Let me know in the comments if you agree that shorter or more focused articles are more attractive. So this week I'll be examining just one paper: Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog. As with last week's papers, this week's is interesting to me professionally. Batch DRL is a way to solve the sample efficiency problem, from a certain perspective. It's mostly the online learning that costs too much when sample efficiency is low, so solving the problems that come with attempting to train offline might allow us to do many of the same things we could do if we had high online sample efficiency.

# RL for open-domain dialog generation¶

The author's domain is dialog generation. They want to build a better chat bot, and they have quite a few recorded conversations. RL is good at refining these processes, but has a cold-start problem, plus they would certainly prefer to make use of the data they have on-hand. For this, they need to be able to make use of offline data, hence "Way Off-Policy". This data is so off-policy it wasn't even generated by a policy.

So they want to train DRL from samples acquired from some other control of the system (in their case, human interaction data), much like Deep Q-learning from Demonstrations. There are a couple of reasons this is important for others such as myself:

First, since collecting real-world interaction data can be expensive and time-consuming, algorithms must be able to leverage off-policy data - collected from vastly different systems, far into the past - in order to learn.

Second, it is often necessary to carefully test a policy before deploying it to the real world; for example, to ensure its behavior is safe and appropriate for humans. Thus the algorithm must be able to learn offline first, from a static batch of data, without the ability to explore

# A generative model + Q learning¶

The authors first pre-train a generative model on the distribution of collected trajectories, and initialize the Q networks from this model. They then sample a fixed number of actions from it, and output the one with the highest Q-value as their policy's decision. In later reinforcement learning, they penalize their model for KL-divergence from this distribution.

To perform batch Q-learning, we first pre-train a generative model of $p(a|s)$ using a set of known environment trajectories. In our case, this model is then used to generate the batch data via human interaction. The weights of the Q-network and target Q-network are initialized from the pre-trained model, which helps reduce variance in the Q-estimates and works to combat overestimation bias. To train $Q_{θ_π}$ we sample < $s_t$, $a_t$, $r_t$, $s_{t+1}$ > tuples from the batch, and update the weights of the Q-network to approximate Eq. 1. This forms our baseline model, which we call Batch Q

# Overestimation bias¶

Most deep RL algorithms fail to learn from data that is not heavily correlated with the current policy. Even models based on off-policy algorithms lik Q-learning fail to learn when the model is not able to explore during training. This is due to the fact that such algorithms are inherently optimistic in the face of uncertainty.

If you’re taking the max of something (as in Bellman-equation-based algorithms), then the higher the variance, the higher the max value. This causes an over-estimation bias. We may have seen a really high value for some state once, so now we over-value that state, despite it being atypical. It may not be immediately obvious why this is a problem, but which states are we likely to overvalue? Precisely the states we haven't visited often. Why is that a problem? This sounds good for exploration, right? But if we're trying to train our agent with canned data, it's important that the live agent stick pretty close to the states where the canned data does well, and it's counter-productive to have it believe that everywhere but the pre-explored state space is worth exploring.

A popular solution to the overestimation problem in Q-learning algorithms is to train two Q networks on the same data, put the input through both, and take the minimum value. This helps with the bias because they'll likely disagree unless we can be really certain of the value of the input, and if they disagree we can go with the least confident. The authors of the current paper take a different tack, training a single neural net with dropout, and using the disagreement with different dropout masks as an estimate of uncertainty.

# Parting thoughts¶

1. I didn't talk much about their model architecture, which is "Variational Hierarchical Recurrent Encoder Decoder (VHRED)", largely because I think if I ever tried to make use of this directly I would employ transformers instead. They do mention that transformer architectures are a "powerful alternative", but they chose to work with hierarchical architectures so they could extend their work to hierarchical control in the future. That's interesting. In my own work at the moment, the important thing is the "way off-policy" part, not so much the chat bot part.

2. It's very interesting to me that both of the methods for correcting overestimation bias make use of uncertainty estimators that I've seen mentioned elsewhere:

...we show that the disagreement between only two neural networks is sufficient to produce a low-variance estimate of the epistemic uncertainty on the return distribution, thus providing a simple and computationally cheap uncertainty metric.

...we develop a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes. A direct result of this theory gives us tools to model uncertainty with dropout NNs

3. This article wasn't really shorter than if I had done multiple papers, less deeply. I'll have to practice at that, not least because it's time-consuming, but information is valuable. How does Adrian Colyer do this every day?