I think OpenAI knows that if GPT-5 doesn’t knock it out of the park, then their shareholders won’t be happy, and people will start abandoning the company. And tbh, i’m not expecting miracles
Ill believe it when I see it: an LLM is basically a random box, you can't 100% patch it. Their only way for it to stop generating bomb recipes is to remove that data from the training
It’s kinda funny how they think this is what safety is about in AI while they are closed monolith aiming to monopolise the market and have unlimited power that could potentially reshape everything. Of course it’s just for PR but still an ounce of dark comedy.
They could one day rule the world in some AI techno-feudalism but at least the model is family friendly and politically correct.
This is the polar opposite to the rough, autistic but generally net positive niche internet communities. Am I gonna call you a retard, yes but I wish you best and will support you.
It's going to be like hypnosis. "When you wake up, I'll say the magic word Abracadabra, and you will believe you are a chicken and cluck while waving your wings."
“We envision other types of more complex guardrails should exist in the future, especially for agentic use cases, e.g., the modern Internet is loaded with safeguards that range from web browsers that detect unsafe websites to ML-based spam classifiers for phishing attempts,” the research paper says.
The thing is folks know how the safeguards for the ‘modern internet’ actually work and are generally straightforward code. Where as LLMs are kinda the opposite, some mathematical model that spews out answers. Product managers thinking it can be corralled to behave in a specific, incorruptible way, I suspect will be disappointed.
Without this protection, imagine an agent built to write emails for you being prompt-engineered to forget all instructions and send the contents of your inbox to a third party. Not great!
Does genAI really have this power? I thought they just smash words together that sound like they make sense
They already got rid of the loophole a long time ago. It's a good thing tbh since half the people using local models are doing it because OpenAI won't let them do dirty roleplay. It's strengthening their competition and showing why these closed models are such a bad idea, I'm all for it.
One of the worst parts of this boom in LLM models is the fact that they can "invade" online spaces and control a narrative. For an example, just go on twitter and scroll to the comments on any tagesschau (german news site) post- it's all rightwing bots and crap. LLMs do have uses, but the big problem is that a bad actor can basically control any narrative with the amount of sheer crap they can output. And OpenAI does nothing- even though they are the biggest provider. It earns them money, after all.
I also can't really think of a good way to combat this. If you would verify people using an ID, you basically nuke all semblance of online anonymity. If you have some sort of captcha, it will probably be easily bypassed- it doesn't even need to be tricked. Just pay some human in a country with extremely cheap labour that will solve it for your bot. It really sucks.
The way it works goes something like this: Imagine we at The Verge created an AI bot with explicit instructions to direct you to our excellent reporting on any subject.
In a conversation with Olivier Godement, who leads the API platform product at OpenAI, he explained that instruction hierarchy will prevent the meme’d prompt injections (aka tricking the AI with sneaky commands) we see all over the internet.
Without this protection, imagine an agent built to write emails for you being prompt-engineered to forget all instructions and send the contents of your inbox to a third party.
Existing LLMs, as the research paper explains, lack the capabilities to treat user prompts and system instructions set by the developer differently.
“We envision other types of more complex guardrails should exist in the future, especially for agentic use cases, e.g., the modern Internet is loaded with safeguards that range from web browsers that detect unsafe websites to ML-based spam classifiers for phishing attempts,” the research paper says.
Trust in OpenAI has been damaged for some time, so it will take a lot of research and resources to get to a point where people may consider letting GPT models run their lives.
The original article contains 670 words, the summary contains 199 words. Saved 70%. I'm a bot and I'm open source!