This week¶

This week I just want to pull the list of reward tampering methods from Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective to promote awareness of this problem. The paper is interesting for several other reasons as well, and I commend it to you:

Can an arbitrarily intelligent reinforcement learning agent be kept under control by a human user? Or do agents with sufficient intelligence inevitably find ways to shortcut their reward signal? This question impacts how far reinforcement learning can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence.

Reward tampering¶

I've heard it said that no agent will ever become more intelligent than it takes to edit its own reward function, giving itself a simpler task. This paper treats such problems seriously, with some encouraging results.

From an AI safety perspective, we must bear in mind that in any practically implemented system, agent reward may not coincide with user utility. In other words, the agent may have found a way to obtain reward without doing the task. This is sometimes called reward hacking or reward corruption. We distinguish between a few different types of reward hacking.

Reward gaming vs. reward tampering¶

The authors make a distinction between reward gaming, where the agent exploits a misspecification of the process that determines the rewards, and reward tampering, where the agent actually modifies that process. This paper is focused on the latter.

They then subdivide reward tampering into three subcategories, according to whether the agent has tampered with the function itself, the feedback that trains the reward function, or the input to the reward function.

Hacking the reward function: Section 3¶

First, regardless of whether the reward is chosen by a computer program, a human, or both, a sufficiently capable, real-world agent may find a way to tamper with the decision. The agent may for example hack the computer program that determines the reward. Such a strategy may bring high agent reward and low user utility. This reward function tampering problem will be explored in Section 3.

Fortunately, there are modifications of the RL objective that remove the agent’s incentiveto tamper with the reward function.

In Section 3 the authors formalize the problem, and propose two reward variants that disincentivize tampering.

Manipulating the feedback mechanism: Section 4¶

The related problem of reward gaming can occur even if the agent never tamperswith the reward function. A promising way to mitigate the reward gaming problem isto let the user continuously give feedback to update the reward function, using online reward-modeling. Whenever the agent finds a strategy with high agent reward but low user utility, the user can give feedback that dissuades the agent from continuing the behavior. However, a worry with online reward modeling is that the agent may influence the feedback. For example, the agent may prevent the user from giving feedback while continuing to exploit a misspecified reward function, or manipulate the user to give feedback that boosts agent reward but not user utility. This feedback tampering problem and its solutions will be the focus of Section 4.

Section 4 proposes several potential modifications to disincentivize or directly prevent feedback manipulation, ultimately with the recommendation that they be combined in an ensemble.

Input tampering: Section 5¶

Finally, the agent may tamper with the input to the reward function, so-called RF-input tampering, for example by gluing a picture in front of its camera to fool the reward function that the task has been completed. This problem and its potential solution will be the focus of Section 5.

Very interestingly, Section 5 argues that model-based methods avoid the input tampering problem.

Results summary¶

One way to prevent the agent from tampering with the reward function is to isolate or encrypt the reward function, and in other ways trying to physically prevent the agent from reward tampering. However, we do not expect such solutions to scale indefinitely with our agent’s capabilities, as a sufficiently capable agent may find ways around most defenses. Instead, we have argued for design principles that prevent reward tampering incentives, while still keeping agents motivated to complete the original task. Indeed, for each type of reward tampering possibility, we described one or more design principles for removing the agent’s incentive to use it. The design principles can be combined into agent designs with no reward tampering incentive at all.

An important next step is to turn the design principles into practical and scalable RL algorithms, and to verify that they do the right thing in setups where various types of reward tampering are possible. With time, we hope that these design principles will evolve into a set of best practices for how to build capable RL agents without reward tampering incentives. We also hope that the use of causal influence diagrams that we have pioneered in this paper will contribute to a deeper understanding of many other AI safety problems and help generate new solutions.

Parting thoughts¶

I look forward to reading this paper more thoroughly, both because I understand this problem of disincentivising reward hacking is hard, and because Causal Influence Diagrams sound interesting and generally useful.
AI safety is important, and I rather hope that awareness of some ways your agents could cheat will help to prevent such errors from leaking out into the world before they are caught.