documentation/reward_hack.myco

Reward hacking, also known as specification gaming and approximately [[Goodhart's law]], is when an [[agentic]] system is given [[incentives]] designed to induce it to act in one way but discovers and applies an easier, undesired way to acquire the incentives. This is often done by finding an edge case in simulations (for [[reinforcement learning]] environments) or rules, optimizing for making a rater think actions or world-states are good rather than doing things they would [[reflectively endorse]], or finding a simple, narrow procedure to increase a score which was supposed to reward general behaviour.

A large list of examples can be found [[https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml|here]], though some are noncentral or non-examples.
Edit ‘reward_hack’ 2024-11-25 16:37:40 +00:00			Reward hacking, also known as specification gaming and approximately [[Goodhart's law]], is when an [[agentic]] system is given [[incentives]] designed to induce it to act in one way but discovers and applies an easier, undesired way to acquire the incentives. This is often done by finding an edge case in simulations (for [[reinforcement learning]] environments) or rules, optimizing for making a rater think actions or world-states are good rather than doing things they would [[reflectively endorse]], or finding a simple, narrow procedure to increase a score which was supposed to reward general behaviour.
Create ‘reward_hack’ 2024-11-25 16:30:14 +00:00
Edit ‘reward_hack’ 2024-11-25 16:37:40 +00:00			`A large list of examples can be found [[https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml\|here]], though some are noncentral or non-examples.`