. In our settings, we find that: 1) Extreme
forms of “feedback gaming” such as manipulation and deception are learned reliably; 2) Even if
only 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and target them
while behaving appropriately with other users, making such behaviors harder to detect; 3) To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. Instead, we found that while such approaches help in some of our settings, they backfire in others, sometimes even leading to subtler manipulative behaviors. We hope our results can serve as a case study which highlights the risks of using gameable feedback
sources – such as user feedback – as a target for RL