Edit ‘reward_hack’

This commit is contained in:
osmarks 2024-11-25 16:37:40 +00:00 committed by wikimind
parent 8e93a50d1b
commit 10c3719a66

View File

@ -1,3 +1,3 @@
Reward hacking, also known as specification gaming and approximately [[Goodhart's law]], is when an [[agentic]] system is given [[incentives]] designed to induce it to act in one way but discovers and applies an easier, undesired way to acquire the incentives. Reward hacking, also known as specification gaming and approximately [[Goodhart's law]], is when an [[agentic]] system is given [[incentives]] designed to induce it to act in one way but discovers and applies an easier, undesired way to acquire the incentives. This is often done by finding an edge case in simulations (for [[reinforcement learning]] environments) or rules, optimizing for making a rater think actions or world-states are good rather than doing things they would [[reflectively endorse]], or finding a simple, narrow procedure to increase a score which was supposed to reward general behaviour.
A large list of examples can be found [[https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml|here]]. A large list of examples can be found [[https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml|here]], though some are noncentral or non-examples.