Can someone tell me where your average every day human that’s walking around and has a regular job and kids and a mortgage would land on this leaderboard? That’s who we should be comparing against.
The fact that the only formal comparisons for AI systems that are ever done are explicitly based on the highest performing narrowly focused humans, tells me how unprepared society is for what’s happening.
Appreciate that: at the point in which there is unambiguous demonstration of superhuman level performance across all human tasks by a machine, (and make no mistake, that *is the bar that this blog post and every other post about AI sets*) it’s completely over for the human race; unless someone figures out an entirely new economic system.
The average person is bad at literally almost everything.
If I want something done, I'll seek out someone with a skill set that matches the problem.
I don't want AI to be as good as an average person. I want AI to be better than the person I would go to for help. A person can talk with me, understand where I've misunderstood my own problem, can point out faulty assumptions, and may even tell me that the problem isn't even a problem that needs solving. A person can suggest a variety of options and let me decide what trade-offs I want to make.
If I don't trust the AI to do that, then I'm not sure why I'd use it for anything other than things that don't need to be done at all, unless I can justify the chance that maybe it'll be done right, and I can afford the time lost getting it done right without the AI afterwards.
Machines have always had superhuman capabilities in narrow domains. The LLM domain is quite broad but it's still just a LLM, beholden to its training.
The average everyday human does not have the time to read all available math texts. LLMs do, but they still can't get bronze. What does that say about them?
One interesting takeaway for me, a non-practitioner, was that the models appears to be fairy decent at judging their own output.
They used best-of-32 and used the same model to judge a "tournament" to find the best answer. Seems like something that could be boltet on reasonably easy, eg in say WebUI.
edit: forgot to add that I'm curious if this translates to smaller models as well, or if it requires these huge models.
The whole competition is unfair anyway. An "AI" has access to millions of similar problems stolen and encoded in the model. Humans would at least need access to a similar database; think open database exam, a nuclear version of open book exam.
Easy benchmark that's hard to fake: data compression. Intelligence is largely about creating compact predictive models and so is data compression. The output should be a program generating the sequence or the dataset, based on entry id or nearby data points. Typical LLM bullshit won't work here because the output isn't English prose that can fool a human.
> For Problem 5, models often identified the correct strategies but failed to prove them, which is, ironically, the easier part for an IMO participant. This contrast ... suggests that models could improve significantly in the near future if these relatively minor logical issues are addressed.
Interesting but I'm not sure if this is really due to "minor logical issues". This sounds like a failure due to the lack of the actual understanding (the world model problem). Perhaps the actual answers from AIs might have some hints, but I can't find them.
(EDIT: ooops, found the output on the main page of their website. Didn't expect that.)
> Best-of-n is Important ... the models are surprisingly effective at identifying the relative quality of their own outputs during the best-of-n selection process and are able to look past coherence to check for accuracy.
What else should people do? If we just saturate at "wow this is amazing!" there's nothing to talk about, nothing to evaluate, nothing to push the boundaries forward further (or caution against doing so).
Yes, we're all impressed, but it's time to move on and start looking at where the frontier is and who's on it.
LLMs are really good with words and kind of crap at “thinking.” Humans are wired to see these two things as tightly connected. A machine that thinks poorly and talks great is inherently confusing. A lot of discussion and disputes around LLMs comes down to this.
It wasn’t that long ago that the Turing Test was seen as the gold standard of whether a machine was actually intelligent. LLMs blew past that benchmark a year or two ago and people barely noticed. This might be moving the goalposts, but I see it as a realization that thought and language are less inherently connected than we thought.
So yeah, the fact that they even do this well is pretty amazing, but they sound like they should be doing so much better.
> Each model was run with the recommended hyperparameters and a maximum token limit of 64,000. No models needs more than this number of tokens
I'm a little confused by this. My assumptions (possibly incorrect!): 64k tokens per prompt, they are claiming the model wouldn't need more tokens even for reasoning
Is that right? Would be helpful to see how many tokens the models actually used.
99.99+% of all problems humans face do not require particularly original solutions. Determining whether LLMs can solve truly original (or at least obscure) problems is interesting, and a problem worth solving, but ignores the vast majority of the (near-term at least) impact they will have.
I was hoping to see the questions (which I can probably find online), but also the answers from models and the judge's scores! Am I missing a link? Without that I can't tell whether I should be impressed or not.
> Gemini 2.5 Pro achieved the highest score with an average of 31% (13 points). While this may seem low, especially considering the $400 spent on generating just 24 answers
What? That’s some serious cash for mostly wrong answers.
In a few months (weeks, days - maybe it has already happened) models will have much better performance on this test.
Not because of actual increased “intelligence” but because the test would be included in model’s training data - either directly or indirectly where model developers “tune” their model to give better performance on this particular attention driving test.
Can someone tell me where your average every day human that’s walking around and has a regular job and kids and a mortgage would land on this leaderboard? That’s who we should be comparing against.
The fact that the only formal comparisons for AI systems that are ever done are explicitly based on the highest performing narrowly focused humans, tells me how unprepared society is for what’s happening.
Appreciate that: at the point in which there is unambiguous demonstration of superhuman level performance across all human tasks by a machine, (and make no mistake, that *is the bar that this blog post and every other post about AI sets*) it’s completely over for the human race; unless someone figures out an entirely new economic system.
The average person is bad at literally almost everything.
If I want something done, I'll seek out someone with a skill set that matches the problem.
I don't want AI to be as good as an average person. I want AI to be better than the person I would go to for help. A person can talk with me, understand where I've misunderstood my own problem, can point out faulty assumptions, and may even tell me that the problem isn't even a problem that needs solving. A person can suggest a variety of options and let me decide what trade-offs I want to make.
If I don't trust the AI to do that, then I'm not sure why I'd use it for anything other than things that don't need to be done at all, unless I can justify the chance that maybe it'll be done right, and I can afford the time lost getting it done right without the AI afterwards.
Machines have always had superhuman capabilities in narrow domains. The LLM domain is quite broad but it's still just a LLM, beholden to its training.
The average everyday human does not have the time to read all available math texts. LLMs do, but they still can't get bronze. What does that say about them?
Average humans, no. Mathematicians with enough time and a well indexed database of millions of similar problems, probably.
We don't allow chess players to access a Syzygy tablebase in a tournament.
Average human would score exactly 0 at IMO.
That’s not how modern societies/economies work.
We have specialists everywhere.
> average every day human
Average math major can't get Brozne.
One interesting takeaway for me, a non-practitioner, was that the models appears to be fairy decent at judging their own output.
They used best-of-32 and used the same model to judge a "tournament" to find the best answer. Seems like something that could be boltet on reasonably easy, eg in say WebUI.
edit: forgot to add that I'm curious if this translates to smaller models as well, or if it requires these huge models.
So the gold medal claims in https://news.ycombinator.com/item?id=44613840 look exaggerated.
The whole competition is unfair anyway. An "AI" has access to millions of similar problems stolen and encoded in the model. Humans would at least need access to a similar database; think open database exam, a nuclear version of open book exam.
Easy benchmark that's hard to fake: data compression. Intelligence is largely about creating compact predictive models and so is data compression. The output should be a program generating the sequence or the dataset, based on entry id or nearby data points. Typical LLM bullshit won't work here because the output isn't English prose that can fool a human.
> For Problem 5, models often identified the correct strategies but failed to prove them, which is, ironically, the easier part for an IMO participant. This contrast ... suggests that models could improve significantly in the near future if these relatively minor logical issues are addressed.
Interesting but I'm not sure if this is really due to "minor logical issues". This sounds like a failure due to the lack of the actual understanding (the world model problem). Perhaps the actual answers from AIs might have some hints, but I can't find them.
(EDIT: ooops, found the output on the main page of their website. Didn't expect that.)
> Best-of-n is Important ... the models are surprisingly effective at identifying the relative quality of their own outputs during the best-of-n selection process and are able to look past coherence to check for accuracy.
Yes, it's always easier to be a backseat driver.
>Yes, it's always easier to be a backseat driver
Any model that can identify the correct answer reliably can arrive at the correct answer given enough time and stochasticity.
How quickly we shift our expectations. If you told me 5 years ago we'd have technology that can do this, I wouldn't believe you.
This isn't to say we shouldn't think critically about the use and performance of models, but "Not Even Bronze..." turned me off to this critique.
What else should people do? If we just saturate at "wow this is amazing!" there's nothing to talk about, nothing to evaluate, nothing to push the boundaries forward further (or caution against doing so).
Yes, we're all impressed, but it's time to move on and start looking at where the frontier is and who's on it.
In 2024 AlphaProof got Silver level, so people righteously expect a lot now.
(It's specifically trained on formalized math problems, unlike most LLM, so it's not an apple to apple comparison.)
LLMs are really good with words and kind of crap at “thinking.” Humans are wired to see these two things as tightly connected. A machine that thinks poorly and talks great is inherently confusing. A lot of discussion and disputes around LLMs comes down to this.
It wasn’t that long ago that the Turing Test was seen as the gold standard of whether a machine was actually intelligent. LLMs blew past that benchmark a year or two ago and people barely noticed. This might be moving the goalposts, but I see it as a realization that thought and language are less inherently connected than we thought.
So yeah, the fact that they even do this well is pretty amazing, but they sound like they should be doing so much better.
Here are the IMO problems if you want to give them a try:
https://www.imo-official.org/year_info.aspx?year=2025 (download page)
They are very difficult.
> Each model was run with the recommended hyperparameters and a maximum token limit of 64,000. No models needs more than this number of tokens
I'm a little confused by this. My assumptions (possibly incorrect!): 64k tokens per prompt, they are claiming the model wouldn't need more tokens even for reasoning
Is that right? Would be helpful to see how many tokens the models actually used.
they didn't even do a (non-ml) agentic descent? like have a quicky api that requeries itself generating new context?
"ok here is my strategy here are the five steps", then requery with a strategy or proof of step 1, 2, 3...
in a dfs
99.99+% of all problems humans face do not require particularly original solutions. Determining whether LLMs can solve truly original (or at least obscure) problems is interesting, and a problem worth solving, but ignores the vast majority of the (near-term at least) impact they will have.
I was hoping to see the questions (which I can probably find online), but also the answers from models and the judge's scores! Am I missing a link? Without that I can't tell whether I should be impressed or not.
> Gemini 2.5 Pro achieved the highest score with an average of 31% (13 points). While this may seem low, especially considering the $400 spent on generating just 24 answers
What? That’s some serious cash for mostly wrong answers.
this makes me really wonder about what is the underlying practical mathematical skill?
intuition????
In a few months (weeks, days - maybe it has already happened) models will have much better performance on this test.
Not because of actual increased “intelligence” but because the test would be included in model’s training data - either directly or indirectly where model developers “tune” their model to give better performance on this particular attention driving test.
Related: https://news.ycombinator.com/item?id=44613840
"You know that really hard test thing that most humans on the planet can't do, or even understand, yeah, LLMs kind of suck at it too"
Meanwhile Noam "well aschtually..."
I love how people are still betting against AI, its hilarious. Please write more 2000-esk "The internet is a fad" articles