# This week¶

This week's paper is Active Perception in Adversarial Scenarios using Maximum Entropy Deep Reinforcement Learning. The idea is that an agent interacting with another agent can learn to assess the threat it may pose. It does this by actively testing the opponent agent's behavior, and does not assume the opponent's behavior remains stationary. It uses Bayesian filtering to update its belief about the disposition of the opponent, and that's why this paper caught my eye. I'm on a Bayesian kick lately.

To summarize, the contribution here is the development of a scalable robust active perception method in scenarios where a potential adversary opponent could be actively hostile to the intent recognition activity, which extends and outperforms the POMDP methods.

I'm a bit short on time this week, so I apologize for the amount of jargon and the unusually high level of confusion.

# Problem setup¶

We model the active perception problem as a planning problem, defined by the tuple $\langle S,A^a,A^o,T,O,R,b_0,\gamma \rangle$, where $S=\langle S^o,S^p \rangle$ is the state of the world, consisting of the set of observable states $S^o$ and the set of partially observable states $S^p$; $A^a$ is the set of actions of the autonomous agent; $A^o$ is the set of actions of the opponent; we further assume that regardless of the intention, the opponent has the same set of observable actions. Otherwise, an intention is easily identifiable once an action that is uniquely corresponding to that type of intention is observed. $T:S \times A^a \times A^o \rightarrow \Delta_S$ is the transition probability, where $\Delta_{\bullet}$ denotes the space of probability distribution over the space $\bullet$. $O: S \times A^a \rightarrow \Delta_{A^o}$ is the observation probability; $R: S \times A^a \times A^o \rightarrow \mathbb{R}$ is the reward function; $b_0$ is the prior probability of the opponent being an adversary; and $\gamma$ is the discount factor.

Further, the opponent is assumed to be either neutral (merely self-interested, in a known way) or hostile (goal-directed, as defined by a known MDP), with bounded rationality, (it may not be able to take the optimal action) and it is likely to behave deceptively.

Notice that the actual behavior of the opponent is known if its disposition is known, which to my mind may or may not be a reasonable assumption, depending on the setting. Since I've had AI safety on the brain lately, it strikes me as unrealistic in a situation where your opponent is smarter than you are. It may be more realistic in settings where everyone has the same goal and it's relatively clear how anyway would try to achieve it if they didn't have to deal with other agents.

The authors' adversarial model is interesting. ($\lambda$ is the parameter to $\pi^o$ that specifies whether the agent is neutral: $\lambda=0$, or adversarial: $\lambda=1$):

We use the following equation to model an adversarial agent's policy $\pi^o$: \begin{align} \pi^o(a^o_t|s_t,\lambda=1;\alpha,\beta)= & \text{argmin}_{\pi \in \Delta} \{\mathbb{KL}(\pi|\pi^{\text{MDP}}_{\alpha})\\ & +\beta \mathbb{KL}(\pi|\pi^o(\cdot|s_t,\lambda=0)) \} \pi^{\text{MDP}}_{\alpha}(a_t^o|s_t,\lambda=1)=e^{\alpha Q(s_t,a_t^o)}/Z(s_t) \end{align}

The thing to take away from this is that both rationality and deception are tunable parameters. The rationality of the opponent is controlled by the temperature parameter $\alpha$, by adjusting how well the opponent makes use of the optimal Q function. The degree to which the opponent is deceptive is controlled by $\beta$, which adjusts how much the KL-divergence of the existing policy from the neutral policy affects the opponent's search for an optimal strategy.

# Bayesian filtering¶

We maintain a belief $b_t(\lambda)$ over the hidden variable by Bayesian filtering.

As I mentioned, I'm rather short on time today, so I must apologize again for not actually spending the time to explain this. For now, suffice it to say that the opponent is either neutral ($\lambda=0$) or hostile ($\lambda=1$), and how your agent reacts to it depends very much on which one of those it believes it is playing against. Bayesian filtering will allow it to make the most of the evidence available, so it can use its best guess as it trains.

We define a hybrid belief-state dependent reward to balance exploration and safety \begin{equation} \begin{aligned} r(b_t,s_t,a^a_t)&=-H(b_t)+r(s_t,a^a_t)\\ &=b\log b+(1-b)\log(1-b)+r(s_t,a^a_t), \end{aligned} \label{eq6} \end{equation} where we use the shorthand $b$ to denote $b_t(\lambda=1)$, the belief that the opponent is an adversary; and $r(s_t,a^a_t)$ is the state dependent reward.

This reward balances exploration behavior and safety. The negative entropy reward $-H(b_t)$ can be interpreted as maximizing the expected logarithm of true positive rate (TPR) and true negative rate (TNR). The state-dependent reward $r(s_t,a^a_t)$ depends both on the observable state and the partially observable intent state $\lambda$, as well as the action of the autonomous agent. This reward is used to ensure safety. For instance, some actions could be dangerous to the neutral [opponent], which are discouraged by a large negative reward.

Our agent is trained using Soft-Q Learning while values of $\lambda$ are varied, with corresponding opponent behavior. Interestingly, in the case study section the authors mention that the actual adversary models were not always provided in the learning phase.

The active perception agent has to identify the hidden intent while bein grobust to this model uncertainty, which is challenging.

# Parting thoughts¶

1. I admit to being a bit confused by this paper. The authors claim to do Bayesian filtering, but it's not an explicit feature of the algorithm. In fact, they seem to be sampling $\lambda$ for use in training by using only $b_0$, their prior probability for their belief state. Perhaps it's a typo.
2. They also seem to claim that the two models of the opponent behavior must be known, but then they mention they're not available during the learning phase in their case study. Drop me a line if this makes sense to you.