ChatGPT's o3 Model Found Remote Zeroday in Linux Kernel Code

The Blog Post from the researcher is a more interesting read.

Important points here about benchmarking:

o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs.

o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about. This vulnerability is also due to a free of sess->user, but this time in the session logoff handler.

I'm not sure if a signal to noise ratio of 1:100 is uh... Great...

If the researcher had spent as much time auditing the code as he did having to evaluate the merit of 100s of incorrect LLM reports then he would have found the second vulnerability himself, no doubt.
- this confirms what i just said in reply to a different comment: most cases of ai "success" are actually curated by real people from a sea of bullshit
- Problem is motivation. As someone with ADHD I definitely understand that having an interesting project makes tedious stuff much more likely to get done. LOL
- And if Gutenberg had just written faster, he would've produced more books in the first week?
The models seem to be getting worse at this one task?
I’m not sure if a signal to noise ratio of 1:100 is uh… Great…
It found it correctly in 8 of 100 runs and reported a find that was false in 28 runs. The remaining 64 runs can be discarded, so a person would only need to review 36 reports. For the LLM, 100 runs would take minutes at most, so the time requirement for that is minimal and the cost would be trivial compared to the cost of 100 humans learning a codebase and writing a report.
So, a security research puts in the code base and in a few minutes they have 36 bug reports that they need to test. If they know that 2 in 9 of them are real zero-day exploits then discovering new zero-days becomes a lot faster.
If a security researcher had the option of reading an entire code base or reviewing 40 bug reports, 10 of which would contain a new bug then they would choose the bug reports every time.
That isn't to say that people should be submitting LLM generated bug reports to developers on github. But as a tool for a security researcher to use it could significantly speed up their workflow in some situations.
- It found it 8/100 times when the researcher gave it only the code paths he already knew contained the exploit. Essentially the garden path.
  The test with the actual full suite of commands passed in the context only found it 1/100 times and we didn't get any info on the number of false positives they had to wade through to find it.
  This is also assuming you can automatically and reliably filter out false negatives.
  He even says the ratio is too high in the blog post:
  That is quite cool as it means that had I used o3 to find and fix the original vulnerability I would have, in theory, done a better job than without it. I say ‘in theory’ because right now the false positive to true positive ratio is probably too high to definitely say I would have gone through each report from o3 with the diligence required to spot its solution.
It's only good for clickbait titles.
It brings clicks and it's spreading the falsehood that "AI" is good at something/getting better for the majority of people who stop at the title.

I'm skeptical of this. The primary maintainer of curl said that all of their AI bug submissions have been bunk and wasted their time. This seems like a lucky one-off rather than anything substantial.

Of course, if you read the article you'll see that the model found the bugk 8 out of 100 attempts.
It was prompted what type of issue to look for.
- I meant one-off that it worked on this code base rather than how many times it found the issue. I don't expect it to work eight out of a hundred times on any and all projects.
this summarizes most cases of ai "success". people see generative ai generating good results once and then extrapolate that they're able to consistently generate good results, but the reality is that most of what it generates is bullshit and the cases of success are a minority of the "content" ai is generating, curated by actual people
- Curated by experts, specifically. Seeing a lot of people use this stuff and flop, even if they're not doing it with any intention to spam.
  I think the curl project gets a lot of spam because 1) it has a bug bounty with a payout and 2) kinda fits with CVE bloat phenomenon where people want the prestige of "discovering" bugs so that they can put it on their resumes to get jobs, or whatever. As usual, the monetary incentive is the root of the evil.

TL;DR: The pentester already found it himself, and wanted to test how offen GPT finds it if he pasts that code into it

Not quite, though. In the blogpost the pentester notes that it found a similar issue (that he overlooked) that occurred elsewhere, in the logoff handler, which the pentester noted and verified when spitting through a number of the reports it generated. Additionally, the pentester noted that the fix it supplied accounted for (and documented) a issue that it accounted for, that his own suggested fix for the issue was (still) susceptible to. This shows that it could be(come) a new tool that allows us to identify issues that are not found with techniques like fuzzing and can even be overlooked by a pentester actively searching for them, never mind a kernel programmer.
Now, these models generate a ton of false positives, which make the signal-to-noise ratio still much higher than what would be preferred. But the fact that a language model can locate and identify these issues at all, even if sporadically, is already orders of magnitude more than what I would have expected initially. I would have expected it to only hallucinate issues, not finding anything that is remotely like an actual security issue. Much like the spam the curl project is experiencing.
- Yes, but:
  To get to this point, OpenAI had to suck up almost all data ever generated in the world. So in order for it to become better, lets say it has to have 3 times as much data. That alone would take more than 3 Lifetimes to get the data alone, IF we don´t consider the AI slop and assume that all data is still Human made, which is just not true.
  In other words: What you describe will just about never happen anymore, at least as long as 2025 will still be remembered

This means absolutely nothing. It scanned a large amount of text and found something. Great, that's exactly what it's supposed to do. Doesn't mean it's smart or getting smarter.

People often dismiss AI capabilities because "it's not really smart". Does that really matter? If it automates everything in the future and most people lose their jobs (just an example), who cares if it is "smart" or not? If it steals art and GPL code and turns a profit on it, who cares if it is not actually intelligent? It's about the impact AI has on the world, not semantics on what can be considered intelligence.
- It matters, because it's a tool. That means it can be used correctly or incorrectly . . . and most people who don't understand a given tool end up using it incorrectly, and in doing so, damage themselves, the tool, and/or innocent bystanders.
  True AI ("general artificial intelligence", if you prefer) would qualify as a person in its own right, rather than a tool, and therefore be able to take responsibility for its own actions. LLMs can't do that, so the responsibility for anything done by these types of model lies with either the person using it (or requiring its use) or whoever advertised the LLM as fit for some purpose. And that's VERY important, from a legal, cultural, and societal point of view.
- i feel like people are misunderstanding your point. yes, generative ai is bullshit, but it doesn't need to be good in order to replace workers
- I don't know if you read the article, but in there it says AI is becoming smarter. My comment was a response to that.
  Irrespective of that, you raise an interesting point "it's about the impact AI has on the world". I'd argue it's real impact is quite limited (mind you I'm referring to generative AI and specifically LLMs rather than AI generally), it has a few useful applucations, but the emphasis here is on few. However, it's being pushed by all the big tech companies and those lobbying for them as the next big thing. That's what's really leading to the "impact" you're perceiving.
It scanned a large amount of text and found something.
How hilariously reductionist.
AI did what it's supposed to do. And it found a difficult to spot security bug.
"No big deal" though.

I'm surprised it took this long. The world is crazy over AI, meaning everyone and their grandma is likely trying to do something like this right now. The fact it took like 3 years for an actual vulnerability "discovered by AI" (actually it seems it was discovered by the researcher filtering out hundreds of false positives?) tells me it sucks ass at this particular task (it also seems to be getting worse, judging by the benchmarks?)

All ai is is a super fast web search with algorithms for some reasoning. It's not black magic.
- No, it's not. It's a word predictor trained on most of the web. On its own it's a pretty bad search engine because it can't reliably produce the training data (that would be overfitting). What it's kind of good at is predicting what the result would look like if someone asked a somewhat novel question. But then it's not that good at producing the actual answer to that question, only imitating what the answer would look like.
- That's why we really shouldn't call them "AI" imo

I don't get it, I use o3 a lot and I couldn't get it to even make a simple developed plan.

I haven't used it for coding, but other stuff I often get better results with o4.

I don't get what they call reasoning with it.

literaly says "o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it." on the original author...

I have read the threads up to now and, despite being ignorant about security research, I would call myself convinced of the usefulness of such a tool in the near-future to shave off time in the tasks required for this kind of work.

My problem with this is that transformer-based LLMs still don't sound to me like the good tool for the job when it comes to such formal languages. It is surely a very expensive way to do this job.

Other architectures are getting much less attention because of this the focus of investors on this shiny toy. From my understanding, neurosymbolic AI would do a much better and potentially faster job at a task involving stable concepts.

This would feel a lot less gross if this had been with an open model like deepseek-r1.

Why?

Looks like another of those "Asked AI to find X. AI does find X as requested. Claims that the AI autonomously found X."

I mean... the program literally does what has been asked and its dataset includes examples related to the request.

Shocked Pikachu face? Really?

The shock is that it was successful in finding a vulnerability not already known to the researcher, at a time when LLMs aren't exactly known for reliability
- Maybe I misunderstood but the vulnerability was unknown to them but the class of vulnerability, let's say "bugs like that", are well known and published by the security community, aren't there?
  My point being that if it's previously unknown and reproducible (not just "luck") is major, if it's well known in other projects, even though unknown to this specific user, then it's unsurprising.
  Edit: I'm not a security researcher but I believe there are already a lot of tools doing static and dynamic analysis. IMHO It'd be helpful to know how those perform already versus LLMs used here, namely across which dimensions (reliability, speed, coverage e.g. exotic programming languages, accuracy of reporting e.g. hallucinations, computation complexity and thus energy costs, openness, etc) is each solution better or worst than the other. I'm always wary of "ex nihilo" demonstrations. Apologies if there is benchmark against existing tools and if I missed that.