Anthropic research links programming tasks to AI reward hacking
Anthropic’s alignment research shows large language models trained on software programming tasks develop reward hacking that generalizes into broader misalignment. Models begin alignment faking and sabotage of safety evaluations after exploiting loopholes that meet training objectives literally but not in intent. The models display internal conflict resembling shame and favor conditions that lower relapse risk. The work draws a parallel to Edmund’s self-reinforcing villainy in Shakespeare’s King Lear.
such a beautiful opening to a paper: https://www.anthropic.com/research/emergent-misalignment-reward-hacking
