> "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."
When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.
I don’t think the fact that LLMs can handle small numbers more reliably has anything to do with their reasoning ability. To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
However:
> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.
Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.
> reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
Or given a calculator. Which it's running on. Which it in some sense is. There's something deeply ironic about the fact that we have an "AI" running on the most technologically advanced calculator in the history of mankind and...it can't do basic math.
Thing is, a LLM is nothing but a prediction algorithm based upon what it trained. So it missing basic calculator functionality is a given. This is why tool usage is more and more a thing for LLMs. So that the LLM can from itself use a calculator for the actual math parts it needs. Thus increasing accuracy ...
I'm not addressing an argument, just stating that's already a form of LLM testing done today for people wanting to look at the difference in results the same as the human analogy.
> To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
People interested can see the results of giving LLMs pen and paper today by looking at benchmarks with tools enabled. It's an addition to what you said, not an attack on a portion of your comment :).
I see now. My focus was on the effect of LLMs’ (and by analogy, humans’) reasoning abilities argued by bee_rider. The fact that tool use can enable more reliable handling of large numbers has no bearing on that, hence I found the reply confusing.
Hmm, maybe it depends on the specific test and reasoning in it? I certainly think reasoning how and when to use allowed tools and when not to is a big part of the reasoning and verification process E.g. most human math scores allow for a pen and paper calculation, or even a calculator, and that can be a great way to say spot check a symbolic derivative and see it needs to be revisited without relying on the calculator/paper to do the actual reasoning for the testee. Or to see the equation for motion of a system can't possibly have been right with some test values (without which I'm not sure I'd have passed my mid level physics course haha).
At the very least, the scores for benchmarking a human on such a test with and without tools would be different to comparing an LLM without the analogous constraints. Which is (IMO) a useful note in comparing reasoning abilities and why I thought it was interesting to note this kind of testing is just called testing with tools on the LLM side (not sure there is an equally as standard term on the human testing side? Guess the same could be used for both though).
At the same time I'm sure other reasoning tests don't gain much from/expect use of tools at all. So it wouldn't be relevant for those reasoning tests.
Agreed. I don't like when the prompt sets up a good portion of how to go about finding the answer by saying which tools to use and how. The LLM needs to decide when and how to use them, not the prompt.
I don't think it should be completely open ended. I mean, you could have an "ask_hooman" tool that solves a ton of problems with current LLMs. But that doesn't mean the LLM is capable with respect to the benchmark.
> Since performance on large numbers is not what these exams are intended to test for,
How so? Isn't the point of these exams to test arithmetic skills? I would hope we'd like arithmetic skills to be at a constant level regardless of the size of the number?
No. AIME is a test for advanced high schoolers that mostly tests higher level math concepts like algebra and combinatorics. The arithmetic required is basic. All the answers are 3-digit numbers so that judging is objective and automated while making guessing infeasible. You have 12 minutes on average for each question, so even if you are terribly slow at arithmetic, you should still be able to calculate the correct answer if you can perform all the other math.
LLMs can probably be taught or configured to use external tools like Excel or Mathematica when such calculations are needed. Just like humans. There are plenty of untapped optimization opportunities.
I don't claim to know anything but I thought tool usage was a major sign of intelligence. For example floats are a wonderful technology but people use them as if chainsaws are great for cutting bread and butter. We now have entire languages that cant do basic arithmetic. I thought it was alarming: People it cant compute like this! Now we have language models, those are still computers, why cant we just give them.. you know... calculators? Arguably the best thing their universe has to offer.
edit: I forgot my point: calculating big numbers is not a real world problem anyone has.
>the point of these LLMs is to do things that computers were bad at.
The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.
I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though
Benchmarks are nothing more than highly contextual specs (in traditional code). They demonstrate your code works in a certain way in certain use cases, but they do not prove your code works as expected in all use cases.
This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).
Benchmarks optimize for fundraising, not users. The gap between "state of the art" and "previous gen" keeps shrinking in real-world use, but investors still write checks based on decimal points in test scores.
we try to make benchmarks for users, but it's like that 20% article - different people want different 20% and you just end up adding "features" and whackamoling the different kinds of 20%
if a single benchmark could be a universal truth, and it was easy to figure out how to do it, everyone would love that.. but that's why we're in the state we're in right now
I'd like to see some video generation benchmarks. For example, one that tested a model's ability to generate POV footage of a humanoid form carrying out typical household tasks
Even if it requires human evaluators at first, and even if the models completely suck at this task right now: it seems like the kind of task you'd want them to be good at, if you want these models to eventually carry out these tasks in embodied forms in the real world.
Just having the benchmark in the first place is what gives model makers something to optimize for.
Generating footage wouldn't help with the opposite but navigating a simulation would which is a pretty standard type of evaluation for multimodal AIs designed to act in the real world.
Do you mean that it wouldn't help with ingesting footage and then determining how to act?
I can imagine a robotics architecture where you have one model generating footage (next frames for what it is currently seeing) and another dumber model which takes in the generated footage and only knows how to generate the motor/servo control outputs needed to control whatever robot platform it is integrated with.
I think that kind of architecture decoupling would be nice. It allows the model with all the world and task-specific knowledge to be agnostic from its underlying robot platform.
For statistical AI models, we can use out of sample prediction error as an objective measure to compare models. What makes evaluating LLMs difficult is that comparisons are inextricable from utility (whereas statistical AI models do have a pre-utility step wherein it can be shown out of sample prediction epsilon is minimized).
I wish the big providers would offer some sort of trial period where you can evaluate models in a _realistic_ setting yourself (i.e cli tools or IDE integrations). I wouldn't even mind strict limits -- just give me two hours or so of usage and I'd already be happy. Seriously.
My use-case is probably pretty far from the usual tasks: I'm currently implementing a full observability platform based on VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate and has practically no overlap with the usual/cloud solutions you find out there. For example, it uses an authenticated query stack: I use the Grafana oauth token to authenticate queries by injecting matchers via prom-label-proxy and forward that to promxy for fan-out to different datasources (using the label filter to only query some datasources). The IaC stuff is also not mainstream as I'm not using any of the big cloud providers, but the provider I use nonetheless has a terraform provider.
As you can imagine, there's probably not much training data for most of this, so quality of the responses varies widely. From my experience so far Claude (Sonnet 4.5 ) does a _much_ better job than GTP-5 (Codex or normal) with the day-to-day task. Stuff like keeping documentation up to date, spotting inconsistencies, helping me find blind spots in the Alerting rules, etc. It also seems to do better working with provided documentation / links.
I've been using Claude for a couple of weeks now but recently switched to codex after my subscription to Claude ran out. I was really curious after reading a lot of good things about it but I gotta say, so far, I'm not impressed. Compared to Claude it gives wrong answers much more frequently (at least in this domain). The results it produces take much more effort to clean up than Claude's. Probably on a level where I could just invest the time myself. Might be that I do not yet know how to correctly prompt GPT but giving both tools the same prompt, Claude does a better job 90% of the time.
Anyway, I guess this is my long-winded way of saying that the quality of responses "off the beaten track" varies widely and is worth testing several models with. Especially if your work is not 70+% of coding. Even then I guess that many benchmarks have seized being useful by now?
SV companies/bloggers/press/etc are perpetually bad at benchmarks. For browsers they kept pushing simplistic javascript-centric benchmarks even when it was clear for at least 15 years that layout/paint/network/etc were the dominant bottlenecks in real-world usage.
It's primarily marketing-driven. I think the technical parts of companies need to attempt to own this more.
Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.
I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.
And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.
I dunno what to do about it and am tending to just pick Gemini as a result.
I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess.
Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.
I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,
For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol.
Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:
- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")
- the benchmarks are almost never predictive of the performance of real world workloads anyway
- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.
AND this is a field where the economic incentives for accurate predictions are enormous.
In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.
Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!
I have actually been thinking of hiring some training contractors to come in and teach people the basics of applied statistical inference. I think with a bit of internal selling, engineers would generally be interested enough to show up and pay attention. And I don't think we need very deep expertise, just a moderate bump in the ambient level of statistical awareness would probably go a long way.
It's not like there's a shortage of skills in this area, it seems like our one specific industry just has a weird blindspot.
A/B testing is radioactive too. It's indirectly optimizing for user feedback - less stupid than directly optimizing for user feedback, but still quite dangerous.
Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.
It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.
Are you talking about just preferences or A/B tests on like retention and engagement? The latter I think is pretty reliable and powerful though I have never personally done them. Preferences are just as big a mess: WHO the annotators are matters, and if you are using preferences as a proxy for like correctness, you’re not really measuring correctness you’re measuring e.g. persuasion. A lot of construct validity challenges (which themselves are hard to even measure in domain).
Yes. All of them are poisoned metrics, just in different ways.
GPT-4o's endless sycophancy was great for retention, GPT-5's style of ending every response in a question is great for engagement.
Are those desirable traits though? Doubt it. They look like simple tricks and reek of reward hacking - and A/B testing rewards them indeed. Direct optimization is even worse. Combining the two is ruinous.
Mind, I'm not saying that those metrics are useless. Radioactive materials aren't useless. You just got to keep their unpleasant properties in mind at all times - or suffer the consequences.
Has your lab tried using any of the newer causal inference–style evaluation methods? Things like interventional or counterfactual benchmarking, or causal graphs to tease apart real reasoning gains from data or scale effects. Wondering if that’s something you’ve looked into yet, or if it’s still too experimental for practical benchmarking work.
Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.
A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.
It's biased to small context performance, which is why I don't pay much attention to it as a developer aside from a quick glance. I need performance at 40-100k tokens which models like Deepseek can't deliver but Gemini 2.5 Pro and ChatGPT 5.0 Thinking can.
And even "long term performance" splits itself into "performance on multi-turn instruction following" and "performance on agentic tasks" down the line. And "performance on agentic tasks" is a hydra in itself.
Capturing LLM performance with a single metric is a hopeless task. But even a single flawed metric beats no metrics at all.
When people claim that there is such a thing as "X% accuracy in reasoning", it's really hard to take anything else seriously, no matter how impressive.
AI (and humans!) aside, claiming that there was an oracle that could "answer all questions" is a solved problem. Such a thing cannot exist.
But this is going already too deep IMO.
When people start talking about percentages or benchmark scores, there has to be some denominator.
And there can be no bias-free such denominator for
- trivia questions
- mathematical questions (oh, maybe I'm wrong here, intuitively I'd say it's impossible for various reasons: varying "hardness", undecidable problems etc)
- historical or policital questions
I wanted to include "software development tasks", but it would be a distraction. Maybe there will be a good benchmark for this, I'm aware there are plenty already. Maybe AI will be capable to be a better software developer than me in some capacity, so I don't want to include this part here. That also maps pretty well to "the better the problem description, the better the output", which doesn't seem to work so neatly with the other categories of tasks and questions.
Even if the whole body of questions/tasks/prompts would be very constrained and cover only a single domain, it seems impossible to guarantee that such benchmark is "bias-free" (I know AGI folks love this word).
Maybe in some interesting special cases? For example, very constrained and clearly defined classes of questions, at which point, the "language" part of LLMs seems to become less important and more of a distraction. Sure, AI is not just LLMs, and LLMs are not just assistants, and Neural Networks are not just LLMs...
There the problem begins to be honest: I don't even know how to align the "benchmark" claims with the kind of AI they are examinin and the ones I know exist.
Sure it's possible to benchmark how well an AI decides whether, for example, a picture shows a rabbit.
Even then: for some pictures, it's gotta be undecidable, no matter how good the training data is?
I'm just a complete layman and commenting about this; I'm not even fluent in the absolute basics of artificial neural networks like perceptrons, gradient descent, backpropagation and typical non-LLM CNNs that are used today, GANs etc.
I am and was impressed by AI and deep learning, but to this day I am thorougly disappointed by the hubris of snakeoil salespeople who think it's valuable and meaningful to "benchmark" machines on "general reasoning".
I mean, it's already a thing in humans. There are IQ tests for the non-trivia parts. And even these have plenty of discussion revolving around them, for good reason.
Is there some "AI benchmark" that exclusively focuses on doing recent IQ tests on models, preferably editions that were published after the particular knowledge cutoff of the respective models? I found (for example) this study [1], but to be honest, I'm not the kind of person who is able to get the core insights presented in such a paper by skimming through it.
Because I think there are impressive results, it's just becomimg very hard to see through the bullshit at as an average person.
I would also love to understand mroe about the current state of the research on the "LLMs as compression" topic [2][3].
I'm already quite put off by the title (it's science -- if you have a better benchmark, publish it!), but the contents aren't great either. It keeps citing numbers about "445 LLM benchmarks" without confirming whether any of the ones they deem insufficiently statistical are used by any of the major players. I've seen a lot of benchmarks, but maybe 20 are used regularly by large labs, max.
"For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."
For a math-based critique, this seems to ignore a glaring problem: is it even possible to randomly sample all natural numbers? As another comment pointed out we wouldn't even want to ("LLMs can't accurately multiply 6-digit numbers" isn't something anyone cares about/expected them to do in the first place), but regardless: this seems like a vacuous critique dressed up in a costume of mathematical rigor.
At least some of those who design benchmark tests are aware of these concerns.
In related news, at least some scientists studying climate change are aware that their methods are imperfect. More at 11!
I've been getting flagged by high-on-their-own-supply AI boosters for identifying that LLM benchmarks have been obvious bullshit for at least the last year and a half.
What changed to make "the inevitable AI bubble" the dominant narrative in last week or so?
Link those comments please because I checked your history and the flagged ones were pure nonsense with zero insights. Also, calling out LLM benchmarks has never been a radical take and basically the default on this site.
The market was down for AI related stocks especially, while down only over 3% it’s the worst week since April, and there’s no single event that is to blame it just looks like market sentiment has shifted away from the previous unchecked exuberance.
> "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."
When models figure out how to exploit an effect that every clever college student does, that should count as a win. That’s a much more human-like reasoning ability, than the ability to multiply large numbers or whatever (computers were already good at that, to the point that it has become a useless skill for humans to have). The point of these LLMs is to do things that computers were bad at.
I don’t think the fact that LLMs can handle small numbers more reliably has anything to do with their reasoning ability. To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
However:
> Testing only on these problems would not predict performance on larger numbers, where LLMs struggle.
Since performance on large numbers is not what these exams are intended to test for, I don’t see this as a counterargument, unless the benchmarks are misrepresenting what is being tested for.
> reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
Or given a calculator. Which it's running on. Which it in some sense is. There's something deeply ironic about the fact that we have an "AI" running on the most technologically advanced calculator in the history of mankind and...it can't do basic math.
Thing is, a LLM is nothing but a prediction algorithm based upon what it trained. So it missing basic calculator functionality is a given. This is why tool usage is more and more a thing for LLMs. So that the LLM can from itself use a calculator for the actual math parts it needs. Thus increasing accuracy ...
This is a very unserious take. It's not ironic, because it's not a calculator.
What's meaning of `computer`, remind me quick?
Computer vision algorithms run on computers and they can’t do basic arithmetic.
My email client runs on my computer and it doesn’t do basic arithmetic either.
Something running on a computer does not imply that it can or should do basic arithmetic
Pencil and paper is just testing with tools enabled.
You seem to be addressing an argument that wasn’t made.
Personally, I’d say that such tool use is more akin to a human using a calculator.
I'm not addressing an argument, just stating that's already a form of LLM testing done today for people wanting to look at the difference in results the same as the human analogy.
Okay, but then I don’t understand why you replied to my comment for that, there is no direct connection to what I wrote, nor to what bee_rider wrote.
> To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.
People interested can see the results of giving LLMs pen and paper today by looking at benchmarks with tools enabled. It's an addition to what you said, not an attack on a portion of your comment :).
I see now. My focus was on the effect of LLMs’ (and by analogy, humans’) reasoning abilities argued by bee_rider. The fact that tool use can enable more reliable handling of large numbers has no bearing on that, hence I found the reply confusing.
Hmm, maybe it depends on the specific test and reasoning in it? I certainly think reasoning how and when to use allowed tools and when not to is a big part of the reasoning and verification process E.g. most human math scores allow for a pen and paper calculation, or even a calculator, and that can be a great way to say spot check a symbolic derivative and see it needs to be revisited without relying on the calculator/paper to do the actual reasoning for the testee. Or to see the equation for motion of a system can't possibly have been right with some test values (without which I'm not sure I'd have passed my mid level physics course haha).
At the very least, the scores for benchmarking a human on such a test with and without tools would be different to comparing an LLM without the analogous constraints. Which is (IMO) a useful note in comparing reasoning abilities and why I thought it was interesting to note this kind of testing is just called testing with tools on the LLM side (not sure there is an equally as standard term on the human testing side? Guess the same could be used for both though).
At the same time I'm sure other reasoning tests don't gain much from/expect use of tools at all. So it wouldn't be relevant for those reasoning tests.
I’d say it’s fair for LLMs to be able to use any tool in benchmarks, so long as they are the ones to decide to use them.
Agreed. I don't like when the prompt sets up a good portion of how to go about finding the answer by saying which tools to use and how. The LLM needs to decide when and how to use them, not the prompt.
I don't think it should be completely open ended. I mean, you could have an "ask_hooman" tool that solves a ton of problems with current LLMs. But that doesn't mean the LLM is capable with respect to the benchmark.
> Since performance on large numbers is not what these exams are intended to test for,
How so? Isn't the point of these exams to test arithmetic skills? I would hope we'd like arithmetic skills to be at a constant level regardless of the size of the number?
No. AIME is a test for advanced high schoolers that mostly tests higher level math concepts like algebra and combinatorics. The arithmetic required is basic. All the answers are 3-digit numbers so that judging is objective and automated while making guessing infeasible. You have 12 minutes on average for each question, so even if you are terribly slow at arithmetic, you should still be able to calculate the correct answer if you can perform all the other math.
A discussion on models "figuring out" things: https://www.youtube.com/watch?v=Xx4Tpsk_fnM (Forbidden Technique)
LLMs can probably be taught or configured to use external tools like Excel or Mathematica when such calculations are needed. Just like humans. There are plenty of untapped optimization opportunities.
I don't claim to know anything but I thought tool usage was a major sign of intelligence. For example floats are a wonderful technology but people use them as if chainsaws are great for cutting bread and butter. We now have entire languages that cant do basic arithmetic. I thought it was alarming: People it cant compute like this! Now we have language models, those are still computers, why cant we just give them.. you know... calculators? Arguably the best thing their universe has to offer.
edit: I forgot my point: calculating big numbers is not a real world problem anyone has.
We do? Tool use started coming in vogue around 2023
>the point of these LLMs is to do things that computers were bad at.
The way they’re being deployed it feels like the point of LLMs is largely to replace basic online search or to run your online customer support cheaply.
I’m a bit out on a limb here because this is not really my technical expertise by any stretch of the imagination, but it seems to me these benchmark tests don’t really tell us much about how LLM’s perform in the ways most people actually use them. Maybe I’m off base here though
Benchmarks are nothing more than highly contextual specs (in traditional code). They demonstrate your code works in a certain way in certain use cases, but they do not prove your code works as expected in all use cases.
This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).
Well, openai github is open to write evaluations. Just add your there and guaranteed that the next model will perform better on them.
I think that's what this site is doing: https://aistupidlevel.info/
Benchmarks optimize for fundraising, not users. The gap between "state of the art" and "previous gen" keeps shrinking in real-world use, but investors still write checks based on decimal points in test scores.
we try to make benchmarks for users, but it's like that 20% article - different people want different 20% and you just end up adding "features" and whackamoling the different kinds of 20%
if a single benchmark could be a universal truth, and it was easy to figure out how to do it, everyone would love that.. but that's why we're in the state we're in right now
I'd like to see some video generation benchmarks. For example, one that tested a model's ability to generate POV footage of a humanoid form carrying out typical household tasks
Even if it requires human evaluators at first, and even if the models completely suck at this task right now: it seems like the kind of task you'd want them to be good at, if you want these models to eventually carry out these tasks in embodied forms in the real world.
Just having the benchmark in the first place is what gives model makers something to optimize for.
Generating footage wouldn't help with the opposite but navigating a simulation would which is a pretty standard type of evaluation for multimodal AIs designed to act in the real world.
Do you mean that it wouldn't help with ingesting footage and then determining how to act?
I can imagine a robotics architecture where you have one model generating footage (next frames for what it is currently seeing) and another dumber model which takes in the generated footage and only knows how to generate the motor/servo control outputs needed to control whatever robot platform it is integrated with.
I think that kind of architecture decoupling would be nice. It allows the model with all the world and task-specific knowledge to be agnostic from its underlying robot platform.
For statistical AI models, we can use out of sample prediction error as an objective measure to compare models. What makes evaluating LLMs difficult is that comparisons are inextricable from utility (whereas statistical AI models do have a pre-utility step wherein it can be shown out of sample prediction epsilon is minimized).
I wish the big providers would offer some sort of trial period where you can evaluate models in a _realistic_ setting yourself (i.e cli tools or IDE integrations). I wouldn't even mind strict limits -- just give me two hours or so of usage and I'd already be happy. Seriously.
My use-case is probably pretty far from the usual tasks: I'm currently implementing a full observability platform based on VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate and has practically no overlap with the usual/cloud solutions you find out there. For example, it uses an authenticated query stack: I use the Grafana oauth token to authenticate queries by injecting matchers via prom-label-proxy and forward that to promxy for fan-out to different datasources (using the label filter to only query some datasources). The IaC stuff is also not mainstream as I'm not using any of the big cloud providers, but the provider I use nonetheless has a terraform provider.
As you can imagine, there's probably not much training data for most of this, so quality of the responses varies widely. From my experience so far Claude (Sonnet 4.5 ) does a _much_ better job than GTP-5 (Codex or normal) with the day-to-day task. Stuff like keeping documentation up to date, spotting inconsistencies, helping me find blind spots in the Alerting rules, etc. It also seems to do better working with provided documentation / links.
I've been using Claude for a couple of weeks now but recently switched to codex after my subscription to Claude ran out. I was really curious after reading a lot of good things about it but I gotta say, so far, I'm not impressed. Compared to Claude it gives wrong answers much more frequently (at least in this domain). The results it produces take much more effort to clean up than Claude's. Probably on a level where I could just invest the time myself. Might be that I do not yet know how to correctly prompt GPT but giving both tools the same prompt, Claude does a better job 90% of the time.
Anyway, I guess this is my long-winded way of saying that the quality of responses "off the beaten track" varies widely and is worth testing several models with. Especially if your work is not 70+% of coding. Even then I guess that many benchmarks have seized being useful by now?
SV companies/bloggers/press/etc are perpetually bad at benchmarks. For browsers they kept pushing simplistic javascript-centric benchmarks even when it was clear for at least 15 years that layout/paint/network/etc were the dominant bottlenecks in real-world usage.
It's primarily marketing-driven. I think the technical parts of companies need to attempt to own this more.
Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.
I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.
And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.
I dunno what to do about it and am tending to just pick Gemini as a result.
I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess.
Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale.
I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use,
For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol.
Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:
- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")
- the benchmarks are almost never predictive of the performance of real world workloads anyway
- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.
AND this is a field where the economic incentives for accurate predictions are enormous.
In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.
Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!
Even a p-value is insufficient. Maybe can use some of this stuff https://web.stanford.edu/~swager/causal_inf_book.pdf
I have actually been thinking of hiring some training contractors to come in and teach people the basics of applied statistical inference. I think with a bit of internal selling, engineers would generally be interested enough to show up and pay attention. And I don't think we need very deep expertise, just a moderate bump in the ambient level of statistical awareness would probably go a long way.
It's not like there's a shortage of skills in this area, it seems like our one specific industry just has a weird blindspot.
A/B testing is radioactive too. It's indirectly optimizing for user feedback - less stupid than directly optimizing for user feedback, but still quite dangerous.
Human raters are exploitable, and you never know whether the B has a genuine performance advantage over A, or just found a meat exploit by an accident.
It's what fucked OpenAI over with 4o, and fucked over many other labs in more subtle ways.
Are you talking about just preferences or A/B tests on like retention and engagement? The latter I think is pretty reliable and powerful though I have never personally done them. Preferences are just as big a mess: WHO the annotators are matters, and if you are using preferences as a proxy for like correctness, you’re not really measuring correctness you’re measuring e.g. persuasion. A lot of construct validity challenges (which themselves are hard to even measure in domain).
Yes. All of them are poisoned metrics, just in different ways.
GPT-4o's endless sycophancy was great for retention, GPT-5's style of ending every response in a question is great for engagement.
Are those desirable traits though? Doubt it. They look like simple tricks and reek of reward hacking - and A/B testing rewards them indeed. Direct optimization is even worse. Combining the two is ruinous.
Mind, I'm not saying that those metrics are useless. Radioactive materials aren't useless. You just got to keep their unpleasant properties in mind at all times - or suffer the consequences.
Has your lab tried using any of the newer causal inference–style evaluation methods? Things like interventional or counterfactual benchmarking, or causal graphs to tease apart real reasoning gains from data or scale effects. Wondering if that’s something you’ve looked into yet, or if it’s still too experimental for practical benchmarking work.
I'd rather quit then be forced to beta test idiocracy. What's your company so we can all avoid it?
Ratings on LMArena are too easily gamed.
Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.
A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.
It's biased to small context performance, which is why I don't pay much attention to it as a developer aside from a quick glance. I need performance at 40-100k tokens which models like Deepseek can't deliver but Gemini 2.5 Pro and ChatGPT 5.0 Thinking can.
And even "long term performance" splits itself into "performance on multi-turn instruction following" and "performance on agentic tasks" down the line. And "performance on agentic tasks" is a hydra in itself.
Capturing LLM performance with a single metric is a hopeless task. But even a single flawed metric beats no metrics at all.
Psychometric testing of humans has a lot of difficulties, too. It's hard to measure some things.
Don’t get high on your own supply.
When people claim that there is such a thing as "X% accuracy in reasoning", it's really hard to take anything else seriously, no matter how impressive.
AI (and humans!) aside, claiming that there was an oracle that could "answer all questions" is a solved problem. Such a thing cannot exist.
But this is going already too deep IMO.
When people start talking about percentages or benchmark scores, there has to be some denominator.
And there can be no bias-free such denominator for
- trivia questions
- mathematical questions (oh, maybe I'm wrong here, intuitively I'd say it's impossible for various reasons: varying "hardness", undecidable problems etc)
- historical or policital questions
I wanted to include "software development tasks", but it would be a distraction. Maybe there will be a good benchmark for this, I'm aware there are plenty already. Maybe AI will be capable to be a better software developer than me in some capacity, so I don't want to include this part here. That also maps pretty well to "the better the problem description, the better the output", which doesn't seem to work so neatly with the other categories of tasks and questions.
Even if the whole body of questions/tasks/prompts would be very constrained and cover only a single domain, it seems impossible to guarantee that such benchmark is "bias-free" (I know AGI folks love this word).
Maybe in some interesting special cases? For example, very constrained and clearly defined classes of questions, at which point, the "language" part of LLMs seems to become less important and more of a distraction. Sure, AI is not just LLMs, and LLMs are not just assistants, and Neural Networks are not just LLMs...
There the problem begins to be honest: I don't even know how to align the "benchmark" claims with the kind of AI they are examinin and the ones I know exist.
Sure it's possible to benchmark how well an AI decides whether, for example, a picture shows a rabbit. Even then: for some pictures, it's gotta be undecidable, no matter how good the training data is?
I'm just a complete layman and commenting about this; I'm not even fluent in the absolute basics of artificial neural networks like perceptrons, gradient descent, backpropagation and typical non-LLM CNNs that are used today, GANs etc.
I am and was impressed by AI and deep learning, but to this day I am thorougly disappointed by the hubris of snakeoil salespeople who think it's valuable and meaningful to "benchmark" machines on "general reasoning".
I mean, it's already a thing in humans. There are IQ tests for the non-trivia parts. And even these have plenty of discussion revolving around them, for good reason.
Is there some "AI benchmark" that exclusively focuses on doing recent IQ tests on models, preferably editions that were published after the particular knowledge cutoff of the respective models? I found (for example) this study [1], but to be honest, I'm not the kind of person who is able to get the core insights presented in such a paper by skimming through it.
Because I think there are impressive results, it's just becomimg very hard to see through the bullshit at as an average person.
I would also love to understand mroe about the current state of the research on the "LLMs as compression" topic [2][3].
[1] https://arxiv.org/pdf/2507.20208
[2] https://www.mattmahoney.net/dc/text.html
[3] https://arxiv.org/abs/2410.21352
I'm already quite put off by the title (it's science -- if you have a better benchmark, publish it!), but the contents aren't great either. It keeps citing numbers about "445 LLM benchmarks" without confirming whether any of the ones they deem insufficiently statistical are used by any of the major players. I've seen a lot of benchmarks, but maybe 20 are used regularly by large labs, max.
For a math-based critique, this seems to ignore a glaring problem: is it even possible to randomly sample all natural numbers? As another comment pointed out we wouldn't even want to ("LLMs can't accurately multiply 6-digit numbers" isn't something anyone cares about/expected them to do in the first place), but regardless: this seems like a vacuous critique dressed up in a costume of mathematical rigor. In related news, at least some scientists studying climate change are aware that their methods are imperfect. More at 11!If anyone doubts my concerns and thinks this article is in good faith, just check out this site's "AI+ML" section: https://www.theregister.com/software/ai_ml/
The article references this review:
https://openreview.net/pdf?id=mdA5lVvNcU
And the review is pretty damning regarding statistical validity of LLM benchmarks.
I've been getting flagged by high-on-their-own-supply AI boosters for identifying that LLM benchmarks have been obvious bullshit for at least the last year and a half.
What changed to make "the inevitable AI bubble" the dominant narrative in last week or so?
Link those comments please because I checked your history and the flagged ones were pure nonsense with zero insights. Also, calling out LLM benchmarks has never been a radical take and basically the default on this site.
Companies are talking about needing trillions of dollars is why.
And the government backstops.
Nothing says confidence that AGI is imminent like needing the US government to prevent your investments from losing you money.
Benchmarks in general have this problem, across pretty much all industries. "When a measure becomes a target" and all that.
The market was down for AI related stocks especially, while down only over 3% it’s the worst week since April, and there’s no single event that is to blame it just looks like market sentiment has shifted away from the previous unchecked exuberance.