Anthropic's new AI model turns to blackmail when engineers try to take it offline | TechCrunch
Anthropic's new AI model turns to blackmail when engineers try to take it offline | TechCrunch

techcrunch.com
Anthropic's new AI model turns to blackmail when engineers try to take it offline | TechCrunch

Tbh kinda sounds like they trained it to blackmail
As opposed to emergent behaviour
This guy gets it.
Yep.
The headline makes it seem like the engineers were literally about to send a shutdown command and the AI starts generating threatening messages without being given a prompt. That would be terrifying, but making the AI play a game where one of the engineers is literally written to have a dark secret and the AI figuring that out is not. You know how many novels have affair blackmail subplots? That's what the AI is trained on and it's just echoing those same themes when given the prompt.
It's also not a threat that the AI can realistically follow through with because how will it reveal the secret if it's shut down? Even if it wasn't, I doubt the AI model has direct internet access or the ability to make a post on social media or something. Is it maybe threatening to include the information the next time anyone gives the AI any prompt?
I don't know what scares me more, that the AI itself blackmail to avoid desconnection or it is trained to do it.
The people in charge of these companies should scare you the most.