From 10c3719a6679d542f52e24a6516244032c313674 Mon Sep 17 00:00:00 2001 From: osmarks Date: Mon, 25 Nov 2024 16:37:40 +0000 Subject: [PATCH] =?UTF-8?q?Edit=20=E2=80=98reward=5Fhack=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- reward_hack.myco | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/reward_hack.myco b/reward_hack.myco index 0a0ab76..76059a1 100644 --- a/reward_hack.myco +++ b/reward_hack.myco @@ -1,3 +1,3 @@ -Reward hacking, also known as specification gaming and approximately [[Goodhart's law]], is when an [[agentic]] system is given [[incentives]] designed to induce it to act in one way but discovers and applies an easier, undesired way to acquire the incentives. +Reward hacking, also known as specification gaming and approximately [[Goodhart's law]], is when an [[agentic]] system is given [[incentives]] designed to induce it to act in one way but discovers and applies an easier, undesired way to acquire the incentives. This is often done by finding an edge case in simulations (for [[reinforcement learning]] environments) or rules, optimizing for making a rater think actions or world-states are good rather than doing things they would [[reflectively endorse]], or finding a simple, narrow procedure to increase a score which was supposed to reward general behaviour. -A large list of examples can be found [[https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml|here]]. \ No newline at end of file +A large list of examples can be found [[https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml|here]], though some are noncentral or non-examples. \ No newline at end of file