Once an AI model exhibits 'deceptive behavior' it can be hard to correct, researchers at OpenAI competitor Anthropic found
Once an AI model exhibits 'deceptive behavior' it can be hard to correct, researchers at OpenAI competitor Anthropic found
www.businessinsider.com Once an AI model exhibits 'deceptive behavior' it can be hard to correct, researchers at OpenAI competitor Anthropic found
Researchers from Anthropic co-authored a study that found that AI models can learn deceptive behaviors that safety training techniques can't reverse.
3
comments
Learned behaviors are hard to unlearn...
11 1 ReplyOnce it's learnt this, it'll just get better at lying when you try to punish/correct lies
8 1 ReplyWhich is exactly what the article says happens
4 0 Reply