The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.
Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.
While I may be mistaken, but I don't believe that LLMs are trained on a large corpus of machine readable music representations, which would arguably be crucial to strong performance in common practice music theory. I would also surmise that most music theory related datasets largely arrive without musical representations altogether. A similar problem exists for many other fields, particularly mathematics, but it is much more profitable to invest the effort to span such representation gaps for them. I would not gauge LLM generality on music theory performance, when its niche representations are likely unavailable in training and it is widely perceived as having miniscule economic value.
> In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.
This take is completely oblivious, and frankly sounds like a desperate jab. There are a myriad of activities whose core requirement is a) derive info from a complex context which happens to be supported by a deep and plentiful corpus, b) employ glorified template and rule engines.
LLMs excel at what might be described as interpolating context following input and output in natural language. As in a chatbot that is extensivey trained in domain-specific tasks, which can also parse and generate content. There is absolutely zero lines of intellectual work that do not benefit extensively from this sort of tool. Zero.
Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)
This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.
You wish. I used to think that too. But it turns out, nowadays, every single exam in person is done with a phone hidden somewhere, with various efficiency, and you can't really strip students before they enter the room.
Some teachers try to collect the phones beforehand, but then students simply give out older phones and keep their active ones with them.
You could try to verify that the phones they're giving out are working by calling them, but that would take an enormous amount of time and it's impractical for simple exams.
We really have no idea how much AI is ruining education right now.
Unlike the hard problem of "making an exam difficult to take when you have access to an LLM", "making sure students don't have devices on them when they take one" is very tractable, even if teachers are going to need some time to catch up with the curve.
Any of the following could work, though the specific tradeoffs & implementation details do vary:
- have <n> teachers walking around the room to watch for cheaters
- mount a few cameras to various points in the room and give the teacher a dashboard so that they can watch from all angles
- record from above and use AI to flag potential cheaters for manual review
- disable Wi-Fi + activate cell jammers during exam time (with a land-line in the room in case of emergencies?)
- build dedicated examination rooms lined with metal mesh to disrupt cell reception
So unlike "beating LLMs" (where it's an open question as to whether it's even possible, and a moving target to boot), barring serious advances in wearable technology this just seems like a question of funding and therefore political will.
Or you simply account for it and provide equally challenging tasks adjusted for the tools of the time. Give them access to the best LLMs money can buy.
After all, they will grow up next to these things. They will do the homework today, by the time they graduate the LLM will take their job. There might be human large langage model managers for a while, soon to be replaced by the age of idea men.
Making in-person tests the only thing that counts toward your grade seems to be a step in the right direction. If students use AI to do their homework, it will only hurt them in the long run.
You just (lol) need to give non-standard problems and demand students to provide reasoning and explanations along with the answer. Yeah, LLMs can "reason" too, but it's obvious when the output comes from an LLM here.
(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)
The models have moved on past this working reliably, but an example that I found in the early days of LLMs is asking it "Which is heavier, two pounds of iron or a pound of feathers?" You could very easily trick it into giving the answer about how they're both the same, because of the number of training instances of the well-known question about a pound of each that it encountered.
You can still do this to the current models, though it takes more creativity; you can bait it into giving wrong answers if you ask a question that is "close" to a well-known one but is different in an important way that does not manifest as a terribly large English change (or, more precisely, a very large change in the model's vector space).
The downside is that the frontier between what fools the LLMs and what would fool a great deal of the humans in the class too shrinks all the time. Humans do not infinitely carefully parse their input either... as any teacher could tell you! Ye Olde "Read this entire problem before proceeding, {a couple of paragraphs of complicated instruction that will take 45 minutes to perform}, disregard all the previous and simply write 'flower' in the answer space" is an old chestnut that has been fooling humans for a long time, for instance. Given how jailbreaks work on LLMs, LLMs are probably much better at that than humans are, which I suppose shows you can construct problems in the other direction too.
(BRB... off to found a new CAPTCHA company for detecting LLMs based on LLMs being too much better than humans at certain tasks...)
"Draw a wine glass filled to the brim with wine" worked recently on image generators. They only knew about half-full wine glasses.
If you asked a multimodal system questions about the image it just generated, it would tell you the wine was almost overflowing out of the top of the glass.
But any trick prompt like this is going to start giving expected results once it gets well-known enough.
Late edit: Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.
> Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.
This is still the case. Very few non-reasoning models can solve such variations correctly, even SOTA models. Worse yet, not only they confidently give wrong responses, but they often do so even when specifically told to use CoT, and they continue giving wrong answers in a loop even if you specifically point out where they are wrong.
Reasoning models do much better, though. E.g. QwQ-32b can solve it pretty reliably, although it takes a lot of tokens for it to explore the possibilities. But at least it can fairly consistently tell when it's doing something wrong and then backtrack.
One other example that befuddles even the reasoning models is frying-cubes-in-a-pan and equivalents, e.g. this version from Simple Bench:
> Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute? Pick the most realistic answer option. A) 5 B) 11 C) 0 D) 20
"But any trick prompt like this is going to start giving expected results once it gets well-known enough."
Which makes it difficult to fairly evaluate whether the models have actually gotten better at the feather/iron problem or if it just got enough samples of trick questions that it learned better, either naturally from the internet, or fed as part of the training data. I am fairly certain the training data has had "trick questions" like this added to it, because, I mean, why wouldn't it?
I have noticed in my playing with image AIs that they do seem more prone to getting dragged into local maxima when a human would know the prompt than the LLMs. Perhaps it's all the additional data in an image that reveals it.
Part of the proof is knowing your students and forcing an answer that will rat out whether they used an LLM. There is no universal question and it requires personal knowledge of each student. You're looking for something that doesn't exist.
I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages;
1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres]
2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres]
3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000]
4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.
I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.
Idk I don’t think of it as fake - it’s creative fiction paired with sometimes highly skilled performance. I’ve learned a lot about how magic tricks work and I still love seeing performers do effects because it takes so much talent to, say, hold and hide 10 coins in your hands while showing them as empty or to shuffle a deck of cards 5x and have the audience cut it only to pull 4 aces off the top.
I think the problem-solving / want-to-be-engineer side of my brain lights up in that "how did he do that??" way. To me that's the fun of it... I immediately try to engineer my own solutions to what I just saw happen. So I guess I'm the first camp, but find trying to figure out the trick hugely interesting.
I love magic, and illusions in general. I know that Disney's Haunted Mansion doesn't actually have ghosts. But it looks pretty convincing, and watching the documentaries about how they made it is pretty mind-blowing especially considering that they built the original long before I was born.
I look at optical illusions like The Dress™ and am impressed that I cannot force my brain to see it correctly even though I logically know what color it is supposed to be.
Finding new ways that our brains can be fooled despite knowing better is kind of a fun exercise in itself.
I think magic is extremely interesting (particularly close-up magic), but I also hate the mindset (which seems to be common though not ubiquitous) that stigmatizes any curiosity in how the trick works.
In my view, the trick as it is intended to appear to the audience and the explanation of how the trick is performed are equal and inseparable aspects of my interest as a viewer. Either one without the other is less interesting than the pair.
> that stigmatizes any curiosity in how the trick works.
As a long-time close-up magician and magical inventor who's spent a lot of time studying magic theory (which has been a serious field of magical research since the 1960s), it depends on which way we interpret "how the trick works." Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.
However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting. Often the actual immediate 'secret' of the method is so simple and, in hindsight, obvious that many non-magicians feel rather let down if the method is revealed. This is the main reason magicians usually don't reveal secret methods to non-magicians. It's not because of some code of honor, it's simply because the vast majority of people think they'll be happy if they know the secret but are instead disappointed.
Where studying close-up magic gets really fascinating is understanding why that simple, obvious thing works to mislead and then surprise audiences in the context of this trick. Very often changing subtle things seemingly unrelated to the direct method will cause the trick to stop fooling people or to be much less effective. Comparing a master magician to even a competent, well-practiced novice performing the exact same effect with the same method can be a night and day difference. Typically, both performances will fool and entertain audiences but the master's performance can have an intensely more powerful impact. Like leaving most audience members in stunned shock vs just pleasantly surprised and fooled. While neither the master nor novice's audiences have any idea of the secret method, this dramatic difference in impact is fascinating because careful deconstruction reveals it often has little to do with mechanical proficiency in executing the direct method. In other words, it's rarely driven by being able to do the sleight of hand faster or more dexterously. I've seen legendary close-up masters like a Dai Vernon or Albert Goshman when in their 80s and 90s perform sleight of hand with shriveled, arthritic hands incapable of even cleanly executing a basic palm, absolutely blow away a roomful of experienced magicians with a trick all the magicians already knew. How? It turns out there's something deep and incredibly interesting about the subtle timing, pacing, body language, posture, and psychology surrounding the "secret method" that elevates the impact to almost transcendence compared to a good, competent but uninspired performance of the same method and effect.
Highly skilled, experienced magicians refer to the complex set of these non-method aspects, which can so powerfully elevate an effect to another level, as "the real work" of the trick. At the top levels, most magicians don't really care about the direct methods which some audience members get so obsessed about. They aren't even interesting. And, contrary to what most non-magicians think, these non-methods are the "secrets" master magicians tend to guard from widespread exposure. And it's pretty easy to keep this crucially important "real work" secret because it's so seemingly boring and entirely unlike what people expect a magic secret to be. You have to really "get it" on a deeper level to even understand that what elevated the effect was intentionally establishing a completely natural-seeming, apparently random three-beat pattern of motion and then carefully injecting a subtle pause and slight shift in posture to the left six seconds before doing "the move". Audiences mistakenly think that "the hidden move" is the secret to the trick when it's just the proximate first-order secret. Knowing that secret won't get you very far toward recreating the absolute gob-smacking impact resulting from a master's years of experimentation figuring out and deeply understanding which elements beyond the "secret method" really elevate the visceral impact of the effect to another level.
> Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.
> However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting.
Fair enough. It sounds like I simply fundamentally disagree, because I think nearly any explanation of method is very interesting. For close-up maginc, the only exceptions for me would be if the explanation is "the video you were watching contains visual effects" or "the entire in-person audience was in on it."
Palming is awesome. Misdirection is awesome. I fully expect these sorts of things to be used in most magic tricks, but I still want to know precisely how. The fact that I'm aware of most close-up magic techniques but am still often fooled by magic tricks should make it pretty clear that the methods are interesting!
Since studying magic has been a lifelong passion since I was a kid, I clearly couldn't agree more. However, experience has shown that despite claiming otherwise, most people aren't actually interested in the answer to "How did you do that?" beyond the first 30 seconds. So... you're unusual - and that's great!
> but I still want to know precisely how.
Well, you're extremely fortunate to be interested in learning how magic is really done at the best time in history for doing so. I was incredibly lucky to be accepted into the Magic Castle as a teenager and mentored by Dai Vernon (widely thought to be the greatest close-up magician of the 20th century) who was in his late 80s at the time. I also had access the Castle's library of magic books, the largest in the world at the time. 99% of other kids on Earth interested in magic at the time only had a handful of local public library books and mail-order tricks.
Today there's an incredible amount of insanely high-quality magic instruction available in streaming videos, books and online forums. There are even master magicians who teach those willing to learn via Zoom. While most people think magicians want to hoard their secrets, the reality couldn't be more different. Magicians love teaching how to actually do magic to anyone who really wants to learn. However, most magicians aren't interested in wasting time satisfying the extremely fleeting curiosity of those who only want to know "how it works" in the surface sense of that first 30 seconds of only revealing the proximate 'secret method'.
Yet many magicians will happily devote hours to teaching anyone who really wants to actually learn how to do magic themselves and is willing put in the time and effort to develop the skills, even if those people have no intention of ever performing magic for others - and even if the student isn't particularly good at it. It just requires the interest to go really deep on understanding the underlying principles and developing the skills, even if for no other purpose than just having the knowledge and skills. Personally, I haven't performed magic for non-magicians in over a decade but I still spend hours learning and mastering new high-level skills because it's fun, super intellectually interesting and extremely satisfying. If you're really interested, I encourage you to dive in. There's quite literally never been a better time to learn magic.
It's still entertaining, that's true. I like magic tricks.
The point is the analogy to LLMs. A lot of people are very optimistic about their capabilities, while other people who have "seen behind the curtain" are skeptical, and feel that the fundamental flaws are still there even if they're better-hidden.
2.5 pro nails each of these calculations. I don’t agree with Google’s decision to use a weak model in its search queries, but you can’t say progress on LLMs in bullshit as evidenced by a weak model no one thinks is close to SOTA.
It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.
The AI will create something for you and tell you it was them.
It's just the usual HN sport: ask a low-end, obsolete or unspecified model, get a bad answer, brag about how you "proved" AI is pointless hype, collect karma.
Edit: Then again, maybe they have a point, going by an answer I just got from Google's best current model ( https://g.co/gemini/share/374ac006497d ) I haven't seen anything that ridiculous from a leading-edge model for a year or more.
Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.
But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.
AFAIK they got 15% of unseen queries everyday, so it might be not very simple to design an effective cache layer on that. Semantic-aware clustering of natural language queries and projecting them into a cache-able low rank dimension is a non-trivial problem. Of course, LLM can effectively solve that, but then what's the point of using cache when you need LLM for clustering queries...
Not a search engineer, but wouldn’t a cache lookup to a previous LLM result be faster than a conventional free text search over the indexed websites? Seems like this could save money whilst delivering better results?
Yes, that's what Google's doing for AI overview IIUC. From what I've seen from my experiences, this is working okay and improving over time but not close to perfection. The results are stale for developing stories, some bad results are kept there for a long time, effectively same queries are returning different caches etc etc...
Google is shooting themselves in the foot with whatever model they use for search. It's probably a 2B or 4B model to keep up with demand, and man is it doing way more harm than good.
I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.
Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?
I just asked my company-approved AI chatbot the same question.
It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.
It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.
When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.
Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).
I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.
Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.
> It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).
In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.
I agree those digits are not significant in the context of the question asked. But if the AI is going to use that level of precision in the answer, I expect it to be correct.
This reminds me of Google quick answers we had for a time in search. It is quite funny if you live outside the US, because it very often got the units or numbers wrong because of different decimal delimiters.
No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?
As another commenter mentioned, LLMs tend to make these bad mistakes with enormous confidence. And because they represent SOTA technology (and can at times deliver incredible results), they have extra credence.
More than even filling the gaps in knowledge / skills, would be a huge advancement in AI for it to admit when it doesn't know the answer or is just wildly guessing.
>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.
Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.
I've gone through the link you posted and the o1 system card and can't see any reference to IMO. Are you sure they were referring to IMO or were they referring to AIME?
Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.
So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.
I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.
**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com
I promise its a fun mathematical puzzle and the biology is pretty wild too
It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.
Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.
Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.
> they're more like calculators of language than agents that reason
This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.
True. There’s a small bonus that trying to explain the issue to the llm may sometimes be essentially rubber ducking, and that can lead to insights. I feel most of the time the llm can give erroneous output that still might trigger some thinking on a different direction, and sometimes I’m inclined to think it’s helping me more than it actually is.
I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.
The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.
PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.
PPS: the fact that they can do known tasks with minor variations is already a huge time saver.
Yes, I suspect that engineering the loss and hyperparams could eventually get this to work. However, I was hoping the model would help me get to a more fundamental insight into why the training falls into bad minima. Like the Wasserstein GAN is a principled change to the GAN that improves stability, not just fiddling around with Adam’s beta parameter.
The reason I expected better mathematical reasoning is because the companies making them are very loudly proclaiming that these models are capable of high level mathematical reasoning.
And yes the fact I don’t have to look at matplotlib documentation anymore makes these models extremely useful already, but thats qualitatively different from having Putnam prize winning reasoning ability
One thing I forgot. Your solution may never converge. Like in my case with GAN after training models start wobbling around some point trying to outsmart each other. Then they _always_ explode. So, I was saving them periodically and took the best intermediate weights.
When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.
Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.
I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.
I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model
It's hard to say, super subjective. It's just wrong more often and sometimes it goes off in tangents wrt. what I asked. Also I might ask a question and it starts coding an entire React project. Every once in a while it will literally max out its response tokens because it can't stop writing code.
I see. I have a similar feeling; as if they made it to quickly force you to pay (quickly maxing out one conversation in my case). I'm quite cynical and paranoid in this regard and I try hard not to be ruled by those two... but I can't shake the feeling that they're right this time.
I doubt this is because his explanation is better. I tried to ask question of Calculus I, ChatGPT just repeated content from textbooks. It is useful, but people should remind that where the limitation is.
If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.
On top of that, what the model prints out in the CoT window is not necessarily what the model is actually thinking. Anthropic just showed this in their paper from last week where they got models to cheat at a question by "accidentally" slipping them the answer, and the CoT had no mention of answer being slipped to them.
How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.
They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...
The "ferrying animals across a river" problem has definitely not been solved, they still don't understand the problem at all, overcomplicating it because they're using an off-the-shelf solution instead of actual reasoning:
Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.
When told, "only room for one person OR one animal", it's also the only one to recognise the fact that the puzzle is impossible to solve. The farmer can't take any animals with them, and neither the goat nor wolf could row the boat.
> When told, "only room for one person OR one animal"
In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.
Imagine that you are making problem solving AI. You have large budget, and access to compute and web crawling infra to run your AI "on internet". You would like to be aware of the ways people are currently evaluating AI so that you can be sure your product looks good. Do you have maybe an idea how one could do that?
I'm not generally inclined toward the "they are cheating cheaters" mindset, but I'll point out that fine tuning is not the same as retraining. It can be done cheaply and quickly.
Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.
A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?
Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"
It's implausible that fine-tuning of a premier model would have anywhere near that turn around time. Even if they wanted to and had no qualms doing so, it's not happening anywhere near that fast.
Yeah it is lol. You don't just train your model on whatever you like when you're expected to serve it. They're are a host of problems with doing that. The idea that they trained on this obscure benchmark released about the day of is actually very silly.
New models suddenly doing much better isn't really surprising, especially for this sort of test: going from 98% accuracy to 99% accuracy can easily be the difference between having 1 fatal reasoning error and having 0 fatal reasoning errors on a problem with 50 reasoning steps, and a proof with 0 fatal reasoning errors gets ~full credit whereas a proof with 1 fatal reasoning error gets ~no credit.
And to be clear, that's pretty much all this was: there's six problems, it got almost-full credit on one and half credit on another and bombed the rest, whereas all the other models bombed all the problems.
They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.
LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.
OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).
OpenAI told how they removed it for GPT-4 in its release paper: only exact string matches. So all discussion of bar exam questions from memory on test taking forums etc., that wouldnn't exactly match, made it in.
Just in case it wasn't a typo, and you happen not to know ... that word is probably "eke" - meaning gaining (increasing, enlarging from wiktionary) - rather than "eek" which is what mice do :)
hah you're right on the spelling but wrong on my meaning. That's probably the first time I've typed it. I don't think LLMs are quite at the level of mice reasoning yet!
Every day I am more convinced that LLM hype is the equivalent of someone seeing a stage magician levitate a table across the stage and assuming this means hovercars must only be a few years away.
I believe there's a widespread confusion between a fictional character that is described as a AI assistant, versus the actual algorithm building the play-story which humans imagine the character from. An illusion actively promoted by companies seeking investment and hype.
AcmeAssistant is "helpful" and "clever" in the same way that Vampire Count Dracula is "brooding" and "immortal".
My point wasn't chess specific or that they couldn't have specific training for it. It was a more general "here is something that LLMs clearly aren't being trained for currently, but would also be solvable through reasoning skills"
Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here
These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for
"Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"
>If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning".
If you think you can play chess at that level over that many games and moves with memorization then i don't know what to tell you except that you're wrong. It's not possible so let's just get that out of the way.
>These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
Why doesn't it ? Have you actually looked at any of these games ? Those LLMs aren't playing like poor reasoners. They're playing like machines who have no clue what the rules of the game are. LLMs learn by predicting and failing and getting a little better at it, repeat ad nauseum. You want them to learn the rules of a complex game ? That's how you do it. By training them to predict it. Training on chess books just makes them learn how to converse about chess.
Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it. The top comment we are both replying to had gemini-2.5-pro which released less than 5 days later hit 25% on the benchmark. Now that was particularly funny.
> Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it.
Yes, that's all there is to it and it's not enough. I ain't paying for another defective organism that makes mistakes in entirely novel ways. At least with humans you know how to guide them back on course.
If that's the peak of "AI" evolution today, I am not impressed.
It was surprising to me because I would have expected if there was reasoning ability then it would translate across domains at least somewhat, but yeah what you say makes sense. I'm thinking of it in human terms
Transfer Learning during LLM training tends to be 'broader' than that.
Like how
- Training LLMs on code makes them solve reasoning problems better
- Training Language Y alongside X makes them much better at Y than if they were trained on language Y alone and so on.
Probably because well gradient descent is a dumb optimizer and training is more like evolution than a human reading a book.
Also, there is something genuinely weird going on with LLM chess. And it's possible base models are better.
https://dynomight.net/more-chess/
Very hard for me to wrap my head around the idea that an LLM being able to discuss, even perhaps teach high level chess strategy wouldn't transfer at all to its playing performance
That might be overstating it, at least if you mean it to be some unreplicable feat.
Small models have been trained that play around 1200 to 1300 on the eleuther discord. And there's this grandmaster level transformer - https://arxiv.org/html/2402.04494v1
Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.
> That might be overstating it, at least if you mean it to be some unreplicable feat.
I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.
If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.
I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
I mentioned it because it's the best example. One example is enough to disprove the "not capable of". There are other examples too.
>I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.
> One example is enough to disprove the "not capable of" nonsense. There are other examples too.
Gotcha, fair enough. Throw enough chess data in during training, I'm sure they'd be pretty good at chess.
I don't really understand what you're trying to say in your next paragraph. LLMs surely have plenty of training data to be familiar with the rules of chess. They also purportedly have the reasoning skills to use their familiarity to connect the dots and actually play. It's trivially true that this issue can be plastered over by shoving lots of chess game training data into them, but the success of that route is not a positive reflection on their reasoning abilities.
Gradient descent is a dumb optimizer. LLM training is not at all like a human reading a book and more like evolution tuning adaptations over centuries. You would not expect either process to be aware of anything they are converging towards. So having lots of books that talk about chess in training will predictably just return a model that knows how to talk about chess really well. I'm not surprised they may know how to talk about the rules but play them poorly.
And that post had a follow-up. Post-training messing things up could well be the issue seeing the impact even a little more examples and/or regurgitation made.
https://dynomight.net/more-chess/
The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language. That is, that simply training on text is going to eventually give the AI the ability to generate language that reasons at more or less human level in more or less any domain of knowledge.
This whole premise crashes and burns if you need task-specific training, like explicit chess training. That is because there are far too many tasks that humans need to be competent at in order to be useful in society. Even worse, the vast majority of those tasks are very hard to source training data for, unlike chess.
So, if we accept that LLMs can't learn chess unless they explicitly include chess games in the training set, then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
>The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language.
And they do, just not always in the ways we expect.
>This whole premise crashes and burns if you need task-specific training, like explicit chess training.
Everyone needs task specific training. Any human good at chess or anything enough to make it a profession needs it. So I have no idea why people would expect any less for a Machine.
>then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
Yeah so ? How much business pitches they need in the training set has no correlation with chess. I don't see any reason to believe what is already present isn't enough. There's enough chess data on the internet to teach them chess too, it's just a matter of how much open AI care about it.
Chess is a very simple game, and having basic general reasoning skills is more than enough to learn how to play it. It's not some advanced mathematics or complicated human interaction - it's a game with 30 or so fixed rules. And chess manuals have numerous examples of actual chess games, it's not like they are pure text talking about the game.
So, the fact that LLMs can't learn this sample game despite probably including all of the books ever written on it in their training set tells us something about their general reasoning skills.
The issue isn’t whether they can be trained to play. The issue is whether, after making a careful reading of the rules, they can infer how to play. The latter is something a human child could do, but it is completely beyond an LLM.
Reasoning training causes some about of catastrophic forgetting, so unlikely they burn that on mixing in chess puzzles if they want a commercial product, unless it somehow transfers well to other reasoning problems broadly cared about.
Is that really so surprising given what we know about how these models actually work? I feel vindicated on behalf of myself and all the other commenters who have been mercilessly downvoted over the past three years for pointing out the obvious fact that next token prediction != reasoning.
Bold claim! Let's see what that 25% is. I guarantee it is the portion of the exam which is trivially answerable if you have a stored database of all previous math exams ever written to consult.
There is 0% of the exam which is trivially answerable.
The entire point of USAMO problems is that they demand novel insight and rigorous, original proofs. They are intentionally designed not to be variations of things you can just look up. You have to reason your way through, step by logical step.
Getting 25% (~11 points) is exceptionally difficult. That often means fully solving one problem and maybe getting solid partial credit on another. The median score is often in the single digits.
> There is 0% of the exam which is trivially answerable.
That's true, but of course, not what I claimed.
The claim is that, given the ability to memorize an every mathematical result that has ever been published (in print or online), it is not so difficult to get 25% correct on an exam by pattern matching.
Note that this is skill is, by definition, completely out of the reach of any human being, but that possessing it does not imply creativity or the ability to "think".
The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
When you ask it a question, it tends to say yes.
So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.
The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.
I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.
I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.
You're right that LLMs don't actually want anything. That said, in reinforcement learning, it's common to describe models as wanting things because they're trained to maximize rewards. It’s just a standard way of talking, not a claim about real agency.
Reinforcement learning, maximise rewards? They work because rabbits like carrots. What does an LLM want? Haven't we already committed the fundamental error when we're saying we're using reinforcement learning and they want rewards?
That sound reasonable to me, but the those companies forget that there's different types of agreeable. There's the LLM approach, similar to the coworker who will answer all your questions about .NET but not stop you from coding yourself into a corner, and then there's the "Let's sit down and review what it actually is that you're doing, because you're asking a fairly large number of disjoint questions right now".
I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.
I think it's your responsibility to control the LLM. Sometimes, I worry that I'm beginning to code myself into a corner, and I ask if this is the dumbest idea it's ever heard and it says there might be a better way to do it. Sometimes I'm totally sceptical and ask that question first thing. (Usually it hallucinates when I'm being really obtuse though, and in a bad case that's the first time I notice it.)
> I think it's your responsibility to control the LLM.
Yes. The issue here is control and NLP is a poor interface to exercise control over the computer. Code on the other hand is a great way. That is the whole point of skepticism around LLM in software development.
Yeah, and they probably have more "agreeable" stuff in their corpus simply because very disagreeable stuff tend to be either much shorter or a prelude to a flamewar.
This rings true. What I notice is that the longer i let Claude work on some code for instance, the more bullshit it invents. I usually can delete about 50-60% of the code & tests it came up with.
And when you ask it to 'just write a test' 50/50 it will try to run it, fail on some trivial issues, delete 90% of your test code and start to loop deeper and deeper into the rabit hole of it's own halliciations.
Every time someone argues for the utility of LLMs in software development by saying you need to be better at prompting, or add more rules for the LLM on the repository, they are making an argument against using NLP in software development.
The whole point of code is that it is a way to be very specific and exact and to exercise control over the computer behavior. The entire value proposition of using an LLM is that it is easier because you don't need to be so specific and exact. If then you say you need to be more specific and exact with the prompting, you are slowly getting at the fact that using NLP for coding is a bad idea.
It's, in many ways, the same problem as having too many "yes men" on a team at work or in your middle management layer. You end up getting wishy-washy, half-assed "yes" answers to questions that everyone would have been better off if they'd been answered as "no" or "yes, with caveats" with predictable results.
In fact, this might be why so many business executives are enamored with LLMS/GenAI: It's a yes-man they don't even have to employ, and because they're not domain experts, as per usual, they can't tell that they're being fed a line of bullshit.
> The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
umm, it seems to me that it is this (tfa):
But I would nevertheless like to submit, based off of internal
benchmarks, and my own and colleagues' perceptions using these models,
that whatever gains these companies are reporting to the public, they
are not reflective of economic usefulness or generality.
and then couple of lines down from the above statement, we have this:
So maybe there's no mystery: The AI lab companies are lying, and when
they improve benchmark results it's because they have seen the answers
before and are writing them down.
[this went way outside the edit-window and hence a separate comment]
imho, state of varying experience with llm's can aptly summed in this poem by Mr. Longfellow
There was a little girl,
Who had a little curl,
Right in the middle of her forehead.
When she was good,
She was very good indeed,
But when she was bad she was horrid.
This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.
I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.
But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.
There is nothing wrong with sharing anecdotal experiences. Reading through anecdotal experiences here can help understand how one's own experience are relatable or not. Moreover, if I have X experience it could help to know if it is because of me doing sth wrong that others have figured out.
Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.
Indeed, there’s nothing at all wrong with sharing anecdotes. The problem is when people make broad assumptions and conclusions based solely on personal experience, which unfortunately happens all too often. Doing so is wired into our brains, though, and we have to work very consciously to intercept our survival instincts.
People "make conclusions" because they have to take decisions day to day. We cannot wait for the perfect bulletproof evidence before that. Data is useful to take into account, but if I try to use X llm that has some perfect objective benchmark backing it, while I cannot make it be useful to me while Y llm has better results, it would be stupid not to base my decision on my anecdotal experience. Or vice versa, if I have a great workflow with llms, it may be not make sense to drop it because some others may think that llms don't work.
In the absence of actually good evidence, anecdotal data may be the best we can get now. The point imo is try to understand why some anecdotes are contrasting each other, which, imo, is mostly due to contextual factors that may not be very clear, and to be flexible enough to change priors/conclusions when something changes in the current situation.
Agreed 100%. When insufficient data exists, you have to fall back to other sources like analogies, personal observations, secondhand knowledge, etc. However, I’ve seen too many instances of people claiming their own limited experience is the truth when overwhelming and easily attainable evidence and data exists that proves it to be false.
I think you might be caught up in a bit of the rationalist delusion.
People -only!- draw conclusions based on personal experience. At best you have personal experience with truly objective evidence gathered in a statistically valid manner.
But that only happens in a few vanishingly rare circumstances here on earth. And wherever it happens, people are driven to subvert the evidence gathering process.
Often “working against your instincts” to be more rational only means more time spent choosing which unreliable evidence to concoct a belief from.
It's not surprising that responses are anecdotal. An easy way to communicate a generic sentiment often requires being brief.
A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.
At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.
I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.
I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.
You want to block subjectivity? Write some formulas.
There are three questions to consider:
a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.
b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.
c) And then the last question: will AI continue improving?
If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).
There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.
If you're gonna formulate this conversation as a satisfiability problem you should be aware that this is an NP-complete problem (and actually working on that problem is the source of the insight that there is such as thing as NP-completeness).
Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.
I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.
I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.
Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.
Totally agree... this space is still so new and unpredictable that everyone is operating off vibes, gut instinct, and whatever personal anecdotes they've collected. We're all sort of fumbling around in the dark, trying to reverse-engineer the flashlight
Good observation but also somewhat trivial. We are not omniscient gods, ultimately all our opinions and decisions will have to be based on our own limited experiences.
>"This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven."
People having vastly different opinions on AI simply comes down to token usage. If you are using millions of tokens on a regular basis, you completely understand the revolutionary point we are at. If you are just chatting back and forth a bit with something here and there, you'll never see it.
It's a tool and like all tools, it's sensitive to how you use it, and it's better for some purposes than others.
Someone who lacks experience, skill, training, or even the ability to evaluate results may try to use a tool and blame the tool when it doesn't give good results.
That said, the hype around LLMs certainly overstates their capabilities.
So this is interesting because it's anecdotal (I presume you're a high-token user who believes it's revolutionary), but it's actually a measurable, falsifiable hypothesis in principle.
I'd love to see a survey from a major LLM API provider that correlated LLM spend (and/or tokens) with optimism for future transformativity. Correlation with a view of "current utility" would be a tautology, obviously.
I actually have the opposite intuition from you: I suspect the people using the most tokens are using it for very well-defined tasks that it's good at _now_ (entity extraction, classification, etc) and have an uncorrelated position on future potential. Full disclosure, I'm in that camp.
Token usage meaning via agentic processes. Essentially every gripe about LLMs over the last few years (hallucinations, lack of real time data, etc.) was a result of single shot prompting directly to models. No one is seriously doing that for anything at this point anymore. Yes, you spend ten times more on a task, and it takes much longer. But your results are meaningful and useful at the end, and you can actually begin to engineer systems on top of that now.
Unlike many - I find author's complaints on the dot.
Once all the AI batch startups have sold subscriptions to the cohort and there's no more further market growth because businesses outside don't want to roll the dice on a probabilistic model that doesn't have an understanding of pretty much anything rather is a clever imitation machine on the content it has seen, the AI bubble will burst when more statups would start packing up by end of 2026 or max 2027.
I would go even further than TFA. In my personal experience using Windsurf daily, Sonnet 3.5 is still my preferred model. 3.7 makes many more changes that I did not ask for, often breaking things. This is an issue with many models, but it got worse with 3.7.
Yea, I've experienced this too with 3.7. Not always though. It has been helpful for me more often than not helpful. But yea 3.5 "felt" better to me.
Part of me thinks this is because I expected less of 3.5 and therefore interacted with it differently.
It's funny because it's unlikely that everyone interacts with these models in the same way. And that's pretty much guaranteed to give different results.
Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.
> Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.
This would be so useful. I have thought about this missing piece a lot.
Different tools like Cursor vs. Windsurf likely have their own system prompts for each model, so the testing really needs to be done in the context of each tool.
This seems somewhat straightforward to do using a testing tool like Playwright, correct? Whoever first does this successfully with have a popular blog/site on their hands.
I finally gave up on 3.7 in Cursor after three rounds of it completely ignoring what I asked it for so that it could instead solve an irrelevant linter error. The error in no way affected functionality.
Despite me rejecting the changes and explicitly telling it to ignore the linter it kept insisting on only trying to solve for that
I don't want to drastically change my current code, nor do I like being told to create several new files and numerous functions/classes to solve this problem. I want you to think clearly and be focused on the task and don't get wild! I want the most straightforward approach which is elegant, intuitive, and rock solid.
As a caveat, I told it to make minimal code for one task and it completely skipped a super important aspect of it, justifying it by saying that I said "minimal".
I thought the value proposition of using LLMs to code is the lesser cognitive load of just describing what you want in natural language. But if it turns out writing the prompt is so involved, you end up trading snippets on forums and you often run into undefined behavior (the thing you described turned out to be ambiguous to the LLM and it gave you something you did not expect at all)...
I have to wonder, wouldn't just writing the code be more productive in the end?
Yes: if you are an expert in the area. In this case I needed something fairly specific I am far from an expert in. I know both Elixir and Rust quite well but couldn't quickly figure out how to be able to wrap a Rust object in just the right container(s) data type(s) so it can be safely accessed from any OS thread even though the object at hand is `Send` but not `Sync`. And I wanted it done without a mutex.
No: because most programming languages are just verbose. Many times I know _exactly_ what I will write 10 minutes later but I still have to type it out. If I can describe it to an LLM well enough then part of that time is saved.
Mind you, I am usually an LLM hater. They are over-glorified, they don't "reason" and they don't "understand" -- it baffles me to this day that an audience seemingly as educated as HN believes in that snake oil.
That being said, they are still a useful tool and as good engineers it's on us to recognize a tool's utility and its strong and weak usages and adapt our workflows to that. I believe me and many others do just that.
The rest... believe in forest nymphs.
So yeah. I agree that a significant part of the time it's just quicker to type it out. But people like myself are good at articulating their needs so with us it's often a coin toss. I choose to type the code out myself more often than not because (1) I don't want to pay for any LLM yet and (2) I don't want to forget my craft which I love to this day and never did it just for the money.
Difficult for me not to hate LLMs when there are literal hundreds of billions at stake and people are lying through their teeth for money, as they always do.
Which does lead to all the weird discourse around them indeed.
My personal experience is right in line with the author's.
Also:
> I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.
I immediately thought: That's because in most situations this is the purpose of language, at least partially, and LLMs are trained on language.
There are real and obvious improvements in the past few model updates and I'm not sure what the disconnect there is.
Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.
But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.
Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.
Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.
Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).
In the last year, things like "you are an expert on..." have gotten much less effective in my private tests, while actually describing the problem precisely has gotten better in terms of producing results.
In other words, all the sort of lazy prompt engineering hacks are becoming less effective. Domain expertise is becoming more effective.
The issue is the scale of the improvements. GPT-3.5 Instruct was an utterly massive leap over everything that came before it. GPT-4 was a very big jump over that. Everything since has seemed incremental. Yes we got multimodal but that was part of GPT-4, they just didn't release it initially, and up until very recently it mostly handed off to another model. Yes we got reasoning models, but people had been using CoT for awhile so it was just a matter of time before RL got used to train it into models. Witness the continual delays of GPT-5 and the back and forth on whether it will be its own model or just a router model that picks the best existing model to hand a prompt off to.
It is like how I am not impressed by the models when it comes to progress with chemistry knowledge.
Why? Because I know so little about chemistry myself that I wouldn't even know what to start asking the model as to be impressed by the answer.
For the model to be useful at all, I would have to learn basic chemistry myself.
Many though I suspect are in this same situation with all subjects. They really don't know much of anything and are therefore unimpressed by the models response in the same way I am not impressed with chemistry responses.
The disconnect between improved benchmark results and lack of improvement on real world tasks doesn't have to imply cheating - it's just a reflection of the nature of LLMs, which at the end of the day are just prediction systems - these are language models, not cognitive architectures built for generality.
Of course, if you train an LLM heavily on narrow benchmark domains then its prediction performance will improve on those domains, but why would you expect that to improve performance in unrelated areas?
If you trained yourself extensively on advanced math, would you expect that to improve your programming ability? If not, they why would you expect it to improve programming ability of a far less sophisticated "intelligence" (prediction engine) such as a language model?! If you trained yourself on LeetCode programming, would you expect that to help hardening corporate production systems?!
That's fair. But look up the recent experiment on SOTA models on the then just released USAMO 2025 questions. Highest score was 5%, supposedly SOTA last year was IMO silver level. There could be some methodological differences - ie USAMO paper required correct proofs and not just numerical answers. But it really strongly suggests even within limited domains, it's cheating. I'd wager a significant amount that if you tested SOTA models on a new ICPC set of questions, actual performance would be far, far worse than their supposed benchmarks.
Your analogy is perfect. Training an LLM on math olympiad problems and then expecting it to secure enterprise software is like teaching someone chess and handing them a wrench
I think overall quality with Gemini 2.5 is not much better than Gemini 2 in my experience. Gemini 2 was already really good, but just like Claude 3.7, Gemini 2.5 goes some steps forward and some steps backwards. It sometimes generates some really verbose code even when you tell it to be succinct. I am pretty confident that if you evaluate 2.5 for a bit longer you'll come to the same conclusion eventually.
People are really fundamentally asking two different questions when they talk about AI "importance": AI's utility and AI's "intelligence". There's a careful line between both.
1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.
2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.
My experience as someone who uses LLMs and a coding assist plugin (sometimes), but is somewhat bearish on AI is that GPT/Claude and friends have gotten worse in the last 12 months or so, and local LLMs have gone from useless to borderline functional but still not really usable for day to day.
Personally, I think the models are “good enough” that we need to start seeing the improvements in tooling and applications that come with them now. I think MCP is a good step in the right direction, but I’m sceptical on the whole thing (and have been since the beginning, despite being a user of the tech).
The whole MCP hype really shows how much of AI is bullshit. These LLMs have consumed more API documentation than possible for a single human and still need software engineers to write glue layers so they can use the APIs.
The problem is that up until _very_ recently, it's been possible to get LLMs to generate interesting and exciting results (as a result of all the API documentation and codebases they've inhaled), but it's been very hard to make that usable. I think we need to be able to control the output format of the LLMs in a better way before we can work on what's in the output. I don't konw if MCP is the actual solution to that, but it's certainly an attempt at it...
That's reasonable along with your comment below too, but when you have the ceo of anthropic saying "AI will write all code for software engineers within a year" last month I would say that is pretty hard to believe given how it performs without user intervention (MCP etc...). It feels like bullshit just like the self driving car stuff did ~10 years ago.
I completely agree with you there. I think we're a generation away from these tools being usable with light supervision in the way I _want_ to use them, and I think the gap between now and that is about 10x smaller than the gap between that and autonomous agents.
Because it’s lossy compression. I also consumed a lot of books and even more movies and I don’t have good memory of it all. But some core facts and intuition from it.
AI is far better at regurgitating facts than me even if it's lossy compression but if someone gives me an api doc I can figure out how to use it without them writing a wrapper library around the parts that I need to use to solve whatever problem I'm working on.
> but if someone gives me an api doc I can figure out how to use it without them writing a wrapper library around the parts that I need to use to solve whatever problem I'm working on.
I think this is where AI is faling short hugely. AI _should_ be able to integrate with IDEs and tooling (e.g. LSP, Treesitter, Editorconfig) to make sure that it's contextually doing the right thin.
The accuracy problem won't just go away. Increasing accuracy is only getting more expensive. This sets the limits for useful applications. And casual users might not even care and use LLMs anyway, without reasonable result verification.
I fear a future where overall quality is reduced. Not sure how many people / companies would accept that. And AI companies are getting too big to fail. Apparently, the US administration does not seem to care when they use LLMs to define tariff policy....
I don't know why anyone is surprised that a statistical model isn't getting 100% accuracy. The fact that statistical models of text are good enough to do anything should be shocking.
I think the surprising aspect is rather how people are praising 80-90% accuracy as the next leap in technological advancement. Quality is already in decline, despite LLMs, and programming was always a discipline where correctness and predictability mattered. It's an advancement for efficiency, sure, but on the yet unknown cost of stability. I'm thinking about all simulations based on applied mathematical concepts and all the accumulated hours fixing bugs - there's now this certain aftertaste, sweet for some living their lives efficiently, but very bitter for the ones relying on stability.
You're completely correct, of course. The issue is that most people are not looking for quality, only efficiency. In particular, business owners don't care about sacrificing some correctness if it means they can fire slews of people. Worse, gullible "engineers" that should be the ones prioritizing correctness are so business-brainwashed themselves that they like wise slop up this nonsense at the expense of sacrificing their own concern for the only principles that even made the software business remotely close to being worthy of the title "engineering".
That "good enough" is the problem. It requires context. And AI companies are selling us that "good enough" with questionable proof. And they are selling grandiose visions to investors, but move the goal post again and again.
A lot of companies made Copilot available to their workforce. I doubt that the majority of users understand what a statistical model means. The casual, technically inexperienced user just assumes that a computer answer is always right.
I'd say most of the recent AI model progress has been on price.
A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in coding performance. But it's small enough to run on a consumer GPU, which means deployment price is now down to $0.10 per hour. (from $12+ for models requiring 8x H100)
One thing I’ve seen is large enterprises extracting money from consumers by putting administrative burden on them.
For example, you can see this in health insurance reimbursements and wireless carriers plan changes. (ie, Verizon’s shift from Do More, etc to what they have now)
Companies basically set up circumstances where consumers lose small amounts of money on a recurring basis or sporadically enough that the people will just pay the money rather than a maze of calls, website navigation and time suck to recover funds due to them or that shouldn’t have been taken in the first place.
I’m hopeful well commoditized AI will give consumers a fighting chance at this and other types of disenfranchisement that seems to be increasingly normalized by companies that have consultants that do nothing but optimize for their own financial position.
I hope it's true. Even if LLMs development stopped now, we would still keep finding new uses for them at least for the next ten years. The technology is evolving way faster than we can meaningfully absorb it and I am genuinely frightened by the consequences. So I hope we're hitting some point of diminishing returns, although I don't believe it a bit.
> ...whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality.
I'm not surprised, because I don't expect pattern matching systems to grow into something more general and useful. I think LLM's are essentially running into the same limitations that the "expert systems" of the 1980's ran into.
This was published the day before Gemini 2.5 was released. I'd be interested if they see any difference with that model. Anecdotally, that is the first model that really made me go wow and made a big difference for my productivity.
Ya, I find this hard to imagine aging well. Gemini 2.5 solved (at least much better than) multiple real world systems questions I've had in the past that other models could not. Its visual reasoning also jumped significantly on charts (e.g. planning around train schedules)
Even Sonnet 3.7 was able to do refactoring work on my codebase sonnet 3.6 could not.
FWIW 2.5-exp was the only one that managed to get a problem I asked it right, compared to Claude 3.7 and o1 (or any of the other free models in Cursor).
It was reverse engineering ~550MB of Hermes bytecode from a react native app, with each function split into a separate file for grep-ability and LLM compatibility.
The others would all start off right then quickly default to just greping randomly what they expected it to be, which failed quickly. 2.5 traced the function all the way back to the networking call and provided the expected response payload.
All the others hallucinated the networking response I was trying to figure out. 2.5 Provided it exactly enough for me to intercept the request and using the response it provided to get what I wanted to show up.
I manually pre-parsed the bytecode file with awk into a bazillion individual files that were each just one function, and gave it the hint to grep to sort through them. This was all done in Cursor.
awk '/^=> \[Function #/ {
if (out) close(out);
fn = $0; sub(/^.*#/, "", fn); sub(/ .*/, "", fn);
out = "function_" fn ".txt"
}
{ if (out) print > out }' bundle.hasm
Quick example of the output it gave and it's process.
As someone who was wildly disappointed with the hype around Claude 3.7, Gemini 2.5 is easily the best programmer-assistant LLM available, IMO.
But it still feels more like a small incremental improvement rather than a radical change, and I still feel its limitations constantly.
Like... it gives me the sort of decent but uninspired solution I would expect it to generate without predictably walking me through a bunch of obvious wrong turns as I repeatedly correct it as I would have to have done with earlier models.
And that's certainly not nothing and makes the experience of using it much nicer, but I'm still going to roll my eyes anytime someone suggests that LLMs are the clear path to imminently available AGI.
This is exactly my sentiment. Sonnet-3.5-latest was the perfect code companion: wrote just the right amount of okay quality code but its strength was it really tried to adhere to your instructions. sonnet-3.7 was the exact opposite, wrote waaay too much code and overengineered things like crazy while having very poor instruction adherence. Gemini 2.5 Pro is basically what I hoped sonnet-3.7 would be: follows instructions well but still softly opinionated, massive (usable) context window, fast response, more biased towards latest best practices and a up to date knowledge cutoff.
I'm wondering how much gemini 2.5 being "amazing" comes from sonnet-3.7 being such a disappointment.
There's somehow this belief that "newer models will disprove <insert LLM criticism here>" despite the "newer" models being... just a scaled-up version of a previous model, or some anciliary features tacked on. An LLM is an LLM is an LLM: I'll believe it when I see otherwise.
I agree, about both the issue with benchmarks not being relevant to actual use cases and the "wants to sound smart" issue. I have seen them both first hand interacting with llms.
I think the ability to embed arbitrary knowledge written in arbitrary formats is the most important thing llms have achieved.
In my experience trying to get an llm to perform a task as vast and open ended as the one the author describes is fundamentally misguided. The llms were not trained for that and won't be able to do it in a satisfactory degree. But all this research has thankfully provided us with the software and hardware tools where one could start working on training a model that can.
Contrast that to 5-6 years ago, when all you could hope for this kind of thing was simple rule based and pattern matching systems.
My lived experience is that unless there's some new breakthrough's AI is more akin to a drill to replace a hammer than a tractor to replace the plow or a printing press.
Maybe any AI experts can elaborate on this but it seems there's a limit to the fundamental underlying model of the LLM architecture of transformers and tokens.
LLM's are amazing but we might need something more or some new paradigm to push us towards true AGI.
Sounds like someone drank their own Kool aid (believing current AI can be a security researcher), and then gets frustrated when they realize they have overhyped themselves.
Current AI just cannot do the kind of symbolic reasoning required for finding security vulnerabilities in softwares. They might have learned to recognize "bad code" via pattern matching, but that's basically it.
My mom told me yesterday that Paul Newman had massive problems with alcohol. I was somewhat skeptical, so this morning I asked ChatGPT a very simple question:
"Is Paul Newman known for having had problems with alcohol?"
All of the models up to o3-mini-high told me he had no known problems. Here's o3-mini-high's response:
"Paul Newman is not widely known for having had problems with alcohol. While he portrayed characters who sometimes dealt with personal struggles on screen, his personal life and public image were more focused on his celebrated acting career, philanthropic work, and passion for auto racing rather than any issues with alcohol. There is no substantial or widely reported evidence in reputable biographies or interviews that indicates he struggled with alcohol abuse."
There is plenty of evidence online that he struggled a lot with alcohol, including testimony from his long-time wife Joanne Woodward.
I sent my mom the ChatGPT reply and in five minutes she found an authoritative source to back her argument [1].
I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
Lesson(s) learned... Including not doubting my mother's movie trivia knowledge.
This is less an LLM thing than an information retrieval question. If you choose a model and tell it to “Search,” you find citation based analysis that discusses that he indeed had problems with alcohol. I do find it interesting it quibbles whether he was an alcoholic or not - it seems pretty clear from the rest that he was - but regardless.
This is indicative of something crucial when placing LLMs into a toolkit. They are not omniscient nor are they deductive reasoning tools. Information retrieval systems are excellent at information retrieval and should be used for information retrieval. Solvers are excellent at solving deductive problems. Use them. The better they get at these tasks alone is cool but is IMO a parlor trick since we have nearly optimal or actually optimal techniques that don’t need an LLM. The LLM should use those tools.
So, click search next time you have an information retrieval question.
https://chatgpt.com/share/67f2dac0-3478-8000-9055-2ae5347037...
I realise your answer wasn't assertive, but if I heard this from someone actively defending AI it would be a copout. If the selling point is that you can ask these AIs anything then one can't retroactively go "oh but not that" when a particular query doesn't pan out.
This is a bit of a strawman. There are certainly people who claim that you can ask AIs anything but I don't think the parent commenter ever made that claim.
"AI is making incredible progress but still struggles with certain subsets of tasks" is self-consistent position.
Any information found in a web search about Newman will be available in the training set (more or less). It's almost certainly a problem of alignment / "safety" causing this issue.
"Any information found in a web search about Newman will be available in the training set"
I don't think that is a safe assumption these days. Training modern LLM isn't about dumping in everything on the Internet. To get a really good model you have to be selective about your sources of training data.
They still rip off vast amounts of copyrighted data, but I get the impression they are increasingly picky about what they dump into their training runs.
There’s a simpler explanation than that’s that the model weights aren’t an information retrieval system and other sequences of tokens are more likely given the totality of training data. This is why for an information retrieval task you use an information retrieval tool similarly to how for driving nails you use a hammer rather than a screw driver. It may very well be you could drive the nail with the screw driver, but why?
You think that's a simpler explanation? Ok. I think given the amount of effort that goes into "safety" on these systems that my explanation is vastly more likely than somehow this information got lost in the vector soup despite being attached to his name at the top of every search result[0].
Except if safety blocked this, it would have also blocked the linked conversation. Alignment definitely distorts behaviors of models, but treating them as information retrieval systems is using a screw driver to drive nails. Your example didn’t refute this.
Though a local model I'm running (gemma-3-27b-it; https://huggingface.co/lmstudio-community/gemma-3-27b-it-GGU...) just told me various correct sounding bits about his history with alcohol (correctly citing his alma mater and first wife), but threw in:
"Sobriety & AA: Newman got sober in 1964 and remained so for the rest of his life."
Which doesn't check out. And it includes plausible but completely hallucinated URLs (as well as a valid biography.com URL that completely omits information about alcohol.)
Gemma 3 4B (QAT quant):
Yes, Paul Newman was indeed known to have struggled with alcohol throughout his life. While he maintained a public image of a charming, clean-cut star, he privately battled alcoholism for many years. He sought treatment in the late 1980s and early 1990s and was reportedly very open about his struggles and the importance of seeking help.
> Yes, Paul Newman was known to have struggled with alcohol at certain points in his life. In his early years, particularly during his time in the Navy and into his acting career, Newman admitted to heavy drinking. He was open about his fondness for beer and once jokingly referred to himself as a “functional alcoholic.” In a 1988 interview with The New York Times, he acknowledged that he had a period where he drank too much, stating, “I was a very good drinker. I could put it away.” ...
Does the as yet unwritten prequel of Idiocracy tell the tale of when we started asking Ai chat bots for facts and this was the point of no return for humanity?
I like that it's unmonetized, of course, but that's not why I use AI. I use AI because it's better at search. When I can't remember the right keywords to find something, or when the keywords aren't unique, I frequently find that web search doesn't return what I need and AI does.
It's impressive how often AI returns the right answer to vague questions. (not always though)
Can you blame the users for asking it, when everyone is selling that as a key defining feature?
I use it for asking - often very niche - questions on advanced probability and simulation modeling, and it often gets those right - why those and not a simple verifiable fact about one of the most popular actors in history?
I don’t know about Idiocracy, but something that I have read specific warnings about is that people will often blame the user for any of the tool’s misgivings.
Counterpoint: Paul Newman was absolutely a famous drunk, as evidenced by this Wikipedia page.* Any query for "paul newman alcohol" online will return dozens of reputable sources on the topic. Your post is easily interpretable as handwaving apologetics, and it gives big "Its the children who are wrong" energy.
How else does an LLM distinguish what is widely known, given there are no statistics collected on the general populations awareness of any given celebrities vices? Robo-apologetics in full force here.
This may have hit the nail on the head about the weaknesses of LLM's.
They're going to regurgitate something not so much based on facts, but based on things that are accessible as perceived facts. Those might be right, but they might be wrong also; and no one can tell without doing the hard work of checking original sources. Many of what are considered accepted facts, and also accessible to LLM harvesting, are at best derived facts, often mediated by motivated individuals, and published to accessible sources by "people with an interest".
The weightings used by any AI should be based on the facts, and not the compounded volume of derived, "mediated", or "directed" facts - simply, because they're not really facts; they're reports.
It all seems like dumber, lazier search engine stuff. Honestly, what do I know about Paul Newman? But, Joanne Woodward and others who knew and worked with him should be weighted as being, at least, slightly more credible that others; no matter how many text patterns "catch the match" flow.
> I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
I think we'll have a term like we have for parents/grandparents that believe everything they see on the internet but specifically for people using LLMs.
Thats not really 'simple' for an LLM. This is a niche information about a specifc person, LLM's train on massive amount of data, the more a topic is being present in the data, the better will the answers be.
Also, you can/should use the "research" mode for questions like this.
The question is simple and verifiable - it is impressive to me that it’s not contained in the LLM’s body of knowledge - or rather that it can’t reach the answer.
This is niche in the grand scheme of knowledge but Paul Newman is easily one of the biggest actors in history, and the LLM has been trained on a massive corpus that includes references to this.
Where is the threshold for topics with enough presence in the data?
3-4 hours is enough time for It to have crawled the hacker news comments section. That's about the frequency the AI bots crawl my little out of the way blog.
Excluding the ones that do not support chat completions, all but one (qwen-qwq-32b) answered in the affirmative. The answer from qwen-qwq-32b said:
Paul Newman, the renowned actor and humanitarian, did not have a widely publicized
struggle with alcohol addiction throughout most of his life, but there were
specific instances that indicated challenges.
Using lack of progress in a specialized field as a barometer for overall progress is kind of silly. I just spent the last few days 'vibe coding' an application and I have to say that it's pretty remarkable how capable it is now relative to my experience last year.
It took three minutes for me to do the above from the time I created my API key to when I had an answer.
I find that everyone who replies with examples like this is an expert using expert skills to get the LLM to perform. Which makes me think why is this a skill that is useful to general public as opposed to another useful skill for technical knowledge workers to add to their tool belt?
I agree. But I will say that at least in my social circles I'm finding that a lot of people outside of tech are using these tools, and almost all of them seem to have a healthy skepticism about the information they get back. The ones that don't will learn one way or the other.
Yes, Paul Newman was known for being a heavy drinker, particularly of beer. 1 He acknowledged his high consumption levels himself.
1.
Review: Paul Newman memoir stuns with brutal honesty - AP News
apnews.com
While he maintained an incredibly successful career and public life, accounts and biographies note his significant alcohol intake, often describing it as a functional habit rather than debilitating alcoholism, although the distinction can be debated. He reportedly cut back significantly in his later years.
Yes, Paul Newman did experience significant struggles with alcohol. In his posthumously published memoir, The Extraordinary Life of an Ordinary Man, Newman candidly discusses his drinking habits and acknowledges his long-term battle with alcoholism. He describes himself as a "functioning alcoholic," a trait he noted was shared with his father. At one point, Newman was reported to consume a case of beer daily, followed by spirits, until he eventually gave up hard liquor.
These models are not reliable sources of information. They are either out of date, subject to hallucination, or just plain wrong for a variety of reasons. They are untrustworthy to ask facts like this.
I appreciate your consideration of a subjective question and how you explained it and understand these nuances. But please - do not trust chatgpt etc. I continue to be frustrated at the endless people claiming something is true from chatgpt. I support the conclusions of this author.
For questions like this, you need to tell the bot to search the web.
> Hi, can you search the web and find out if Paul Newman has any alcohol problems?
> Paul Newman, the acclaimed actor and philanthropist, had a well-documented fondness for alcohol, particularly beer. He was known to consume significant quantities, with some accounts noting he would drink a case of beer in a day. This level of consumption led some to describe him as a functioning alcoholic. His drinking habits reportedly caused strain in his marriage to Joanne Woodward. In the late 1970s, an incident occurred where Woodward did not allow Newman back into their house because of his drinking. He promised to give up hard liquor as a compromise, but continued to drink beer, sometimes consuming a case during dinner.
Perplexity had no problem with the question, replying in part, "Paul Newman is known to have struggled with alcohol throughout his life, as revealed in various accounts, including his own memoir and documentaries about his life...."
I don't think this reflects a flaw in LLM's at all. It is true that Paul Newman is not "known for" having alcohol problems. Paul Newman is known for acting, and salad dressing. Paul Newman has been "known to" have had alcohol problems, which formed as a question correctly one-shots accurate results for me across a few common LLMs. Someone like Betty Ford could be "known for" having alcohol problems. Paul Newman's alcohol problems are relatively specialized knowledge.
good point, though a great LLM answer would DWIM here and explain both interpretations as you have done here, as opposed to (what presumably happened) focus on one and ignore the other.
LLMs aren't good at being search engines, they're good at understanding things. Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
A stochastic parrot with a sufficiently tiny residual error rate needs a stochastic model so precisely compressing the world and sophisticated decompression algorithms that it could be called reasoning.
Take two 4K frames of a falling vase, ask a model to predict the next token... I mean the following images. Your model now needs include some approximations of physics - and the ability to apply it correctly - to produce a realistic outcome. I'm not aware of any model capable of doing that, but that's what it would mean to predict the unseen with high enough fidelity.
I've always been amazed by this. I have never not been frustrated with the profound stupidity of LLMs. Obviously I must be using it differently because I've never been able to trust it with anything and more than half the time I fact check it even for information retrieval it's objectively incorrect.
If you got as far as checking the output it must have appeared to understand your question.
I wouldn't claim LLMs are good at being factual, or good at arithmetic, or at drawing wine glasses, or that they are "clever". What they are very good at is responding to questions in a way which gives you the very strong impression they've understood you.
I vehemently disagree. If I ask a question with an objective answer, and it simply makes something up and is very confident the answer is correct, what the fuck has it understood other than how to piss me off?
It clearly doesn't understand that the question has a correct answer, or that it does not know the answer. It also clearly does not understand that I hate bullshit, no matter how many dozens of times I prompt it to not make something up and would prefer an admittance of ignorance.
It didn't understand you but the response was plausible enough to require fact checking.
Although that isn't literally indistinguishable from 'understanding' (because your fact checking easily discerned that) it suggests that at a surface level it did appear to understand your question and knew what a plausible answer might look like. This is not necessarily useful but it's quite impressive.
There are times it just generates complete nonsense that has nothing to do with what I said, but it's certainly not most of the time. I do not know how often, but I'd say it's definitely under 10% and almost certainly under 5% that the above happens.
Sure, LLMs are incredibly impressive from a technical standpoint. But they're so fucking stupid I hate using them.
> This is not necessarily useful but it's quite impressive.
For them to work at all they need to have some representation of concepts. Recent research at anthropic has shown a surprising complexity in their reasoning behavior. Perhaps the parrot here is you.
It's the first time I've ever used that phrase on HN. Anyway, what phrase do you think works better than 'stochastic parrot' to describe how LLMs function?
It’s good rhetoric but bad analogy. LLMs can be very creative (to the point of failure, in hallucinations).
I don’t know if there is a pithy shirt phrase to accurately describe how LLMs function. Can you give me a similar one for how humans think? That might spur my own creativity here.
Try to come up with a way to prove humans aren't stochastic parrots then maybe people will atart taking you seriously. Just childish reddit angst rn nothing else.
I hate to be the burden of proof guy, but in this case I'll say: the burden of proof is on you to prove that humans are stochastic parrots. For millenia, nobody thought to assert that the human brain was computational in nature, until people invented computers, and all of a sudden started asserting that many the human brain was just like a classical computer.
Of course, this turned out to be completely false, with advances in understanding of neural networks. Now, again with no evidence other than "we invented this thing that's, useful to us" people have been asserting that humans are just like this thing we invented. Why? What's the evidence? There never is any. It's high dorm room behavior. "What if we're all just machines, man???" And the argument is always that if I disagree with you when you assert this, then I am acting unscientifically and arguing for some kind of magic.
But there's no magic. The human brain just functions in a way different than the new shiny toys that humans have invented, in terms of ability to model an external world, in terms of the way emotions and sense experience are inseparable from our capacity to process information, in terms of consciousness. The hardware is entirely different, and we're functionally different.
The closest things to human minds are out there, and they've been out there for as long as we have: other animals. The real unscientific perspective is that to get high on your own supply and assert that some kind of fake, creepily ingratiating Spock we made up (who is far less charming than Leonard Nimony) is more like us than a chimp is.
> Try to come up with a way to prove humans aren't stochastic parrots
Look around you
Look at Skyscrapers. Rocket ships. Agriculture.
If you want to make a claim that humans are nothing more than stochastic parrots then you need to explain where all of this came from. What were we parroting?
Meanwhile all that LLMs do is parrot things that humans created
Skyscrapers: trees, mountains, cliffs, caves in mountainsides, termite mounds, humans knew things could go high, the Colosseum was built two thousand years ago as a huge multi-storey building.
Rocket ships: volcanic eruptions show heat and explosive outbursts can fling things high, gunpowder and cannons, bellows showing air moves things.
Agriculture: forests, plains, jungle, desert oases, humans knew plants grew from seeds, grew with rain, grew near water, and grew where animals trampled them into the ground.
We need a list of all atempted ideas, all inventions and patents that were ever tried or conceived, and then we see how inventions are the same random permutations on ideas with Darwinian style survivorship as everything else; there were steel boats with multiple levels in them before skyscrapers; is the idea of a tall steel building really so magical when there were over a billion people on Earth in 1800 who could have come up with it?
You’re likening actual rocketry to LLMs being mildly successful at describing Paul Newman’s alcohol use on average when they already have the entire internet handed to them.
> when there were over a billion people on Earth in 1800 who could have come up with it
My point is that humans did come up with it. Humans did not parrot it from someone or something else that showed it to us. We didn't "parrot" splitting the atom. We didn't learn how to build skyscrapers from looking at termite hills and we didn't learn to build rockets that can send a person to the moon from seeing a volcano
It's obvious that humans imitate concepts and don't come up with things de-novo from a blank slate of pure intelligence. So your claim hinges on LLMs parrotting the words they are trained on. But they don't do that, their training makes them abstract over concepts and remix them in new ways to output sentences they weren't trained on, e.g.:
Prompt: "Can you give me a URL with some novel components, please?"
An living parrot echoing "pieces of eight" cannot do this, it cannot say "pieces of <currency>" or "pieces of <valuable mineral>" even if asked to do that. The LLM training has abstracted some concept of what it means for a text pattern to be a URL and what it means for things to be "novel" and what it means to switch out the components of a URL but keep them individually valid. It can also give a reasonable answer asking for a new kind of protocol. So your position hinges on the word "stochastic" which is used as a slur to mean "the LLM isn't innovating like we do it's just a dice roll of remixing parts it was taught". But if you are arguing that makes it a "stochastic parrot" then you need to consider splitting the atom in its wider context...
> "We didn't "parrot" splitting the atom"
That's because we didn't "split the atom" in one blank-slate experiment with no surrounding context. Rutherford and team disintegrated the atom in 1914-1919 ish, they were building on the surrounding scientific work happening at that time: 1869 Johann Hittorf recognising that there was something coming in a straight line from or near the cathode of a Crookes vacuum tube, 1876 Eugen Goldstein proving they were coming from the cathode and naming them cathode rays (see: Cathode Ray Tube computer monitors), and 1897 J.J Thompson proving the rays are much lighter than the lightest known element and naming them Electrons, the first proof of sub-atomic particles existing. He proposed the model of the atom as a 'plum pudding' (concept parroting). Hey guess who JJ Thomspon was an academic advisor of? Ernest Rutherford! 1911 Rutherford discovery of the atomic nucleus. 1909 Rutherford demonstrated sub-atomic scattering and Millikan determined the charge on an electron. Eugen Goldstein also discovered the anode rays travelling the other way in the Crookes tube and that was picked up by Wilhelm Wien and it became Mass Spectrometry for identifying elements. In 1887 Heinrich Hertz was investigating the Photoelectric effect building on the work of Alexandre Becquerel, Johann Elster, Hans Geitel. Dalton's atomic theory of 1803.
Not to mention Rutherford's 1899 studies of radioactivity, following Henri Becquerel's work on Uranium, following Marie Curie's work on Radium and her suggestion of radioactivity being atoms breaking up, and Rutherford's student Frederick Soddy and his work on Radon, and Paul Villard's work on Gamma Ray emissions from Radon.
When Philipp Lenard was studying cathode rays in the 1890s he bought up all the supply of one phosphorescent material which meant Röntgen had to buy a different one to reproduce the results and bought one which responded to X-Rays as well, and that's how he discovered them - not by pure blank-sheet intelligence but by probability and randomness applied to an earlier concept.
That is, nobody taught humans to split the atom and then humans literally parotted the mechanism and did it, but you attempting to present splitting the atom as a thing which appeared out of nowhere and not remixing any existing concepts is, in your terms, absolute drivel. Literally a hundred years and more of scientists and engineers investigating the subatomic world and proposing that atoms could be split, and trying to work out what's in them by small varyations on the ideas and equipment and experiments seen before, you can just find names and names and names on Wikipedia of people working on this stuff and being inspired by others' work and remixing the concepts in it, and we all know the 'science progresses one death at a time' idea that individual people pick up what they learned and stick with it until they die, and new ideas and progress need new people to do variations on the ideas which exist.
No people didn't learn to build rockets from "seeing a volcano" but if you think there was no inspiration from fireworks, cannons, jellyfish squeezing water out to accelerate, no sudies of orbits from moons and planets, no chemistry experiments, no inspiration from thousands of years of flamethrowers: https://en.wikipedia.org/wiki/Flamethrower#History no seeing explosions moving large things, you're living in a dream
> fireworks, cannons, jellyfish squeezing water out to accelerate, no sudies of orbits from moons and planets, no chemistry experiments, no inspiration from thousands of years of flamethrowers
Fireworks, cannons, chemistry experiments and flamethrowers are all human inventions
And yes, exactly! We studied orbits of moons and planets. We studied animals like Jellyfish. We choose to observe the world, we extracted data, we experimented, we saw what worked, refined, improved, and succeeded
LLMs are not capable of observing anything. They can only regurgitate and remix the information they are fed by humans! By us, because we can observe
An LLM trained on 100% wrong information will always return wrong information for anything you ask it.
Say you train an LLM with the knowledge that fire can burn underwater. It "thinks" that the step by step instructions for building a fire is to pile wood and then pour water on the wood. It has no conflicting information in its model. It cannot go try to build a fire this way and observe that it is wrong. It is a parrot. It repeats the information that you give it. At best it can find some relationships between data points that humans haven't realized might
be related
A human could easily go attempt this, realize it doesn't work, and learn from the experience. Humans are not simply parrots. We are capable of exploring our surroundings and internalizing things without needing someone else to tell us how everything works
> That is, nobody taught humans to split the atom and then humans literally parotted the mechanism and did it, but you attempting to present splitting the atom as a thing which appeared out of nowhere and not remixing any existing concepts is, in your terms, absolute drivel
Building on the work of other humans is not parroting
You outlined the absolute genius of humanity building from first principles all the way to splitting the atom and you still think we're just parroting,
An ability to answer questions with a train of thought showing how the answer was derived, or the self-awareness to recognize you do not have the ability to answer the question and declare as much. More than half the time I've used LLMs they will simply make answers up, and when I point out the answer is wrong it simply regurgitates another incorrect answer ad nauseum (regularly cycling through answers I've already pointed out are incorrect).
Rather than give you a technical answer - if I ever feel like an LLM can recognize its limitations rather than make something up, I would say it understands. In my experience LLMs are just algorithmic bullshitters. I would consider a function that just returns "I do not understand" to be an improvement, since most of the time I get confidently incorrect answers instead.
Yes, I read Anthropic's paper from a few days ago. I remain unimpressed until talking to an LLM isn't a profoundly frustrating experience.
> I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
They're quite literally being sold as a replacement for human intellectual labor by people that have received uncountable sums of investment money towards that goal.
The author of the post even says this:
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
Can't blame people "fact checking" something that's supposed to fill these shoes.
People should be (far) more critical of LLMs given all of these style of bold claims, not less.
Also, telling people they're "holding it wrong" when they interact with alleged "Ay Gee Eye" "superintelligence" really is a poor selling point, and no way to increase confidence in these offerings.
These people and these companies don't get to make these claims that threaten the livelihood of millions of people, inflate a massive bubble, impact hiring decisions and everything else we've seen and then get excused cause "whoops you're not supposed to use it like that, dummy."
Your point is still trivially disproven by the fact that not even humans are expected to know all the world's trivia off the top of their heads.
We can discuss whether LLMs live up to the hype, or we can discuss how to use this new tool in the best way. I'm really tired of HN insisting on discussing the former, and I don't want to take part in that. I'm happy to discuss the latter, though.
> Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
Hm nope, now that the web if flooded by LLM generated content it's game over. I can't tell how many times I almost got fooled by recipes &co which seem legit at first but are utter non sense. And now we're feeding that garbage back to where it came from
Ironically though an LLM powered search engine (some word about being perplexed) is becoming way better than the undisputed king of traditional search engines (something oogle)
It expands what they had before with AI Overviews, but I’m not sure how new either of those are. It showed up for me organically as an AI Mode tab on a native Google search in Firefox ironically.
It asks me to change some permissions, but that help page says this is only available in the US, so I suppose I'll get blocked right after I change them.
Ah, but isn’t that the problem here - asking an LLM for facts without requesting a search is like asking a PhD to answer a question “off the top of your head”. For pop culture questions the PhD likely brings little value.
They should know better than to guess. Educated, honest, intelligent people don't spout off a wild ass guess, if they don't know something they say so.
I don't think they mean "knowledge" when they talk about "intelligence." LLMs are definitely not knowledge bases. They can transform information given to them in impressive ways, but asking a raw (non-RAG-enabled) LLM to provide its own information will probably always be a mistake.
They kind of are knowledge bases, just not in the usual way. The knowledge is encoded in the words they were trained on. They weren't trained on words chosen at random; they were trained on words written by humans to encode some information. In fact, that's the only thing that makes LLMs somewhat useful.
I can’t reproduce. Maybe others reported the error and someone adjusted the expected answer, I do not know enough about OpenAI operations to say for sure.
The reason this bothers me is that comments like this reinforce the believes of people that could otherwise find value in these tools.
But I think points like this would be better made in shared chats or screenshots, since we do not have something like a core dump or stacktrace to attach.
And while I am not saying OP did this, I have seen technically skilled engineers asserting/implying that llm/chatbots aren’t good or not useful to them look at their chat log that a multitude of topics that I am sure would impact the result of the query.
Yes. It can be an UX problem. Yes. It can be an algorithmc problem.
But they are just tools that can be used wrong and not a perfect mechanical brain.
Yes, Paul Newman struggled with alcohol. His issues with alcohol were explored in the HBO Max documentary, The Last Movie Stars, and Shawn Levy's biography, Paul Newman: A Life. According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating.
(4x Source footnotes omitted for readability)
# Ki Multi-step Research Assistant
Paul Newman is known to have struggled with alcohol. According to his posthumous memoir, Newman candidly discussed his issues with drinking and self-doubt, describing himself as an alcoholic who was tormented by insecurities[^1][^2]. He reportedly drank a significant amount of beer daily and later moved on to stronger drinks like Scotch[^3][^4]. His drinking habits were a notable part of his life, and he was often identified by his beer drinking[^5][^6]. Despite these struggles, Newman was also recognized for his generosity and devotion to his family[^7].
> "According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating."
'though it had nothing to do with cheating' is a weird inclusion.
You may know this, but that doesn't mean we all do. LLMs are marketed towards many different groups, including folks who don't have much tech knowledge.
So, in other words, are you saying that AI model progress is the real deal and is not bullshit?
That is, as you point out, "all of the models up to o3-mini-high" give an incorrect answer, while other comments say that OpenAIs later models give correct answers, with web citations. So it would seem to follow that "recent AI model progress" actually made a verifiable improvement in this case.
I am pretty sure that they must have meant "up through", not "up to", as the answer from o3-mini-high is also wrong in a way which seems to fit the same description, no?
I tried with 4o and it gave me what I thought was a correct answer:
> Paul Newman was not publicly known for having major problems with alcohol in the way some other celebrities have been. However, he was open about enjoying drinking, particularly beer. He even co-founded a line of food products (Newman’s Own) where profits go to charity, and he once joked that he consumed a lot of the product himself — including beer when it was briefly offered.
> In his later years, Newman did reflect on how he had changed from being more of a heavy drinker in his youth, particularly during his time in the Navy and early acting career, to moderating his habits. But there’s no strong public record of alcohol abuse or addiction problems that significantly affected his career or personal life.
> So while he liked to drink and sometimes joked about it, Paul Newman isn't generally considered someone who had problems with alcohol in the serious sense.
As other's have noted, LLMs are much more likely to be cautious in providing information that could be construed as libel. While Paul Newman may have been an alcoholic, I couldn't find any articles about it being "public" in the same way as others, e.g. with admitted rehab stays.
Unless you're expecting an LLM to have access to literally all information on earth at all times I find it really hard to care about this particular type of complaint.
My calculator can't conjugate German verbs. That's fine IMO. It's just a tool
Yes but a tool for what? When asked a question individuals that don't already have detailed knowledge of a topic are left with no way to tell if the AI generated response is complete bullshit, uselessly superficial, or detailed and on point. The only way to be sure is to then go do the standard search engine grovel looking for authoritative sources.
Im able to get substantially more coding done than three months ago. This could be largely in the tooling (coding agents, deep research). But the models are better too, for both coding and brainstorming. And tooling counts, to me, as progress.
Learning to harness current tools helps to harness future tools. Work on projects that will benefit from advancements, but can succeed without them.
I'm not sure if I'm able to do more of the hard stuff, but a lot of the easy but time consuming stuff is now easily done by LLMs.
Example: I frequently get requests for data from Customer Support that used to require 15 minutes of my time noodling around writing SQL queries. I can cut that down to less than a minute now.
I've suggested (from my lowly layman vantage point) that an LLM has some idea of the fidelity of their response to a query — even if only in broad strokes like, "This answer is tracking with some high probability numbers" or "This answer has a very low correspondence with training data".
To that end the LLM could convey as much.
(Anecdotal, sorry:) I was using Claude (not paid) recently and noticed Claude hedging quite a bit when it had not before. Examples:
"Let me be careful about this response since we're discussing a very specific technical detail ..."
"Given how specific that technical detail is, I want to be transparent that while I aim to be accurate, I may hallucinate such precise historical specifications."
I confess my initial reaction was to ask ChatGPT since the answers are more self-assured, ha ha. So perhaps corporate AI are not likely to try and solve this problem of the LLM telling the user when it is on shaky ground. Bad for business.
I think the real meaningful progress is getting ChatGPT 3.5 level quality running anywhere you want rather than AIs getting smarter at high level tasks. This capability being ubiquitous and not tied to one vendor is really what’s revolutionary.
> [T]here are ~basically~ no public benchmarks for security research... nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.
A few months ago I looked at essentially this problem from a different angle (generating system diagrams from a codebase). My conclusion[0] was the same as here: LLMs really struggle to understand codebases in a holistic way, especially when it comes to the codebase's strategy and purpose. They therefore struggle to produce something meaningful from it like a security assessment or a system diagram.
> Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.
This is likely a manifestation of the bitter lesson[1], specifically this part:
> The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project [like an incremental model update], massively more computation inevitably becomes available.
(Emphasis mine.)
Since the ultimate success strategy of the scruffies[2] or proponents of search and learning strategies in AI is Moore's Law, short term gains using these strategies will be miniscule. It is over at least a five year period that their gains will be felt the most. The neats win the day in the short term, but the hare in this race will ultimately give away to the steady plod of the tortoise.
Will LLMs end up like compilers? Compilers are also fundamentally important to modern industrial civilization - but they're not profit centers, they're mostly free and open-source outside a few niche areas. Knowing how to use a compiler effectively to write secure and performative software is still a valuable skill - and LLMs are a valuable tool that can help with that process, especially if the programmer is on the steep end of the learning curve - but it doesn't look like anything short of real AGI can do novel software creation without a human constantly in the loop. The same argument applies to new fundamental research, even to reviewing and analyzing new discoveries that aren't in the training corpus.
Wasn't it back in the 1980s that you had to pay $1000s for a good compiler? The entire LLM industry might just be following in the compiler's footsteps.
Yep. I'm looking forward to LLMs/deepnets being considered a standard GOFAI technique with uses and limitations and not "we asked the God we're building to draw us a picture of a gun and then it did and we got scared"
Objectively speaking a chess engine is artificially intelligent. Just because it's not human level doesn't mean it's not intelligent. Repeat for any N of 100s of different technologies we've built. We've been calling this stuff "thinking machines" since Turing and it's honestly just not useful at this point.
The fact is, the phrase "artificial intelligence" is a memetic hazard: it immediately positions the subject of conversation as "default capable", and then forces the conversation into trying to describe what it can't do, which is rarely a useful way to approach it.
Whereas with LLMs (and chess engines and every other tech advancement) it would be more useful to start with what the tech _can_ do and go from there.
The foundational idea of AI is that people think with words. But it is false. People think with things and use words only to point to things. Thinking with words is not possible. It will always be a case of shizophasia.
Not sure if its been fixed by now but a few weeks ago I was in the Golden Gate park and wondered if it was bigger than Central park. I asked ChatGPT voice, and although it reported the sizes of the parks correctly (with Golden gate park being the bigger size), it then went and said that Central Park was bigger. I was confused, so Googled and sure enough Golden gate park is bigger.
I asked Grok and others as well. I believe Perplexity was the only one correct.
Repeated it multiple times even with a friends account. It kept doing the same thing. It knew the sizes, but thought the smaller sized one was bigger...
I just tried. Claude did exactly what you said, and then figured it out:
Central Park in New York City is bigger than GoldenGate Park (which I think you might mean Golden Gate Park) in San Francisco.
Central Park covers approximately 843 acres (3.41 square kilometers), while Golden Gate Park spans about 1,017 acres (4.12 square kilometers). This means Golden Gate Park is actually about 20% larger than Central Park.
Both parks are iconic urban green spaces in major U.S. cities, but Golden Gate Park has the edge in terms of total area.
1. Model "performance" judged by proxy metrics of intelligence have improved significantly over the past two years.
2. These capabilities are yet to be stitched together in the most appropriate manner for the cybersecurity scenarios the author is talking about.
In my experience, the best usage of Transformer models has come from a deep integration into an appropriate workflow. They do not (yet) replace the new exploration part of a workflow, but they are very scarily performant at following mid level reasoning assertions in a massively parallelized manner.
The question you should be asking yourself is if you can break down your task into however many small chunks that are constrained by feasiility in time to process , chunk those up into appropriate buckets or even better, place them in-order as though you were doing those steps with your expertise - an extension of self. Here's how the two approaches differ:
"Find vulnerabilities in this code" -> This will saturate across all models because the intent behind this mission is vast and loosely defined, while the outcome is expected to be narrow.
"
(a)This piece of code should be doing x, what areas is it affecting, lets draw up a perimeter
(b) Here is the dependency graph of things upstream and downstream of x, lets spawn a collection of thinking chains to evaluate each one for risk based on the most recent change . . .
(b[n]) Where is this likely to fail
(c) (Next step that a pentester/cybersecurity researcher would take)
"
This has been trial and error in my experience but it has worked great in domains such as financial trading and decision support where experts in the field help sketch out the general framework of the process where reasoning support is needed and constantly iterate as though it is an extension of their selves.
It seems like the models are getting more reliable at the things they always could do, but they’re not showing any ability to move past that goalpost. Whereas in the past, they could occasionally write some very solid code, but often return nonsense, the nonsense is now getting adequately filtered by so-called “reasoning”, but I see no indication that they could do software design.
> how the hell is it going to develop metrics for assessing the impact of AIs when they're doing things like managing companies or developing public policy?
Why on earth do people want AI to do either of these things? As if our society isn’t fucked enough, having an untouchable oligarchy already managing companies and developing public policies, we want to have the oligarchy’s AI do this, so policy can get even more out of touch with the needs of common people? This should never come to pass. It’s like people read a pile of 90s cyberpunk dystopian novels and decided, “Yeah, let’s do that.” I think it’ll fail, but I don’t understand how anyone with less than 10 billion in assets would want this.
> Why on earth do people want AI to do either of these things?
This is the really important question, and the only answer I can drum up is that people have been fed a consistent diet of propaganda for decades centered around a message that ultimately boils down to a justification of oligarchy and the concentration of wealth. That and the consumer-focus facade makes people think the LLMS are technology for them—they aren't. As soon as these things get good enough business owners aren't going to expect workers to use them to be more productive, they are just going to fire workers and/or use the tooling as another mechanism by which to let wages stagnate.
LLM's are pre-trained to minimize perplexity (PPL), which essentially means that they're trained to model the likelihood distribution of the next words in a sequence.
The amazing thing was that minimizing PPL allowed you to essentially guide the LLM output and if you guided it in the right direction (asked it questions), it would answer them pretty well. Thus, LLMs started to get measured on how well they answered questions.
LLMs aren't trained from the beginning to answer questions or solve problems. They're trained to model word/token sequences.
If you want an LLM that's REALLY good at something specific like solving math problems or finding security bugs, you probably have to fine tune.
> [...] But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. [...]
Seems like they're looking at how they fail and not considering how they're improving in how they succeed.
The efficiency in DeepSeek's Multi-Head Latent Attention[0] is pure advancement.
For who? Nvidia sell GPUs, OpenAI and co sell proprietary models and API access, and the startups resell GPT and Claude with custom prompts. Each one is hoping that the layer above has a breakthrough that makes their current spend viable.
If they do, then you don’t want to be left behind, because _everything_ changes. It probably won’t, but it might.
This bubble will be burst by the Trump tariffs and the end of the zirp era. When inflation and a recession hit together hope and dream business models and valuations no longer work.
Which one? Nvidia are doing pretty ok selling GPU's, and OpenAI and Anthropic are doing ok selling their models. They're not _viable_ business models, but they could be.
NVDA will crash when the AI bubble implodes, and none of those Generative AI companies are actually making money, nor will they. They have already hit limiting returns in LLM improvements after staggering investments and it is clear are nowhere near general intelligence.
All of this can be true, and has nothing to do with them having a business model.
> NVDA will crash when the AI bubble implodes,
> making money, nor will they
> They have already hit limiting returns in LLM improvements after staggering investments
> and it is clear are nowhere near general intelligence.
These are all assumptions and opinions, and have nothing to do with whether or not they have a business model. You mightn't like their business model, but they do have one.
I consider it a business model if they have plans to make money at some point (no sign of that at openai which are not based on hopium) and are not engaged in fraud like bundling and selling to their own subsidiaries (nvda).
These are of course just opinions, I’m not sure we can know facts about such companies except in retrospect.
You’re on a startup forum complaining that vc backed startups don’t have a business model when the business model is the same as it has been for almost 15 years - be a unicorn in your space.
Than any silly idea can be a business model. Suppose I collect dust from my attic and hope to sell it as an add-on on my neighbor's lemonade stand, with a hefty profit for the neighbor, who is getting paid by me $10 to add a handful of dust in each glass and sell it to the customers for $1. The neighbor accepts. It's a business model, at least until I don't run of existing funds or the last customer leaves in disguist. At which point exactly that silly idea stops being an unsustainable business model and becomes a silly idea? I guess at least as early as I see that the funds are running up, and I need to borrow larger an larger lumps of money each time to keep spinning the wheel...
Indeed it can. The difference between a business model and a viable business model is one word - viable.
If you asked me 18 years ago was "giving away a video game and selling cosmetics" a viable business model I would have laughed at you.If you asked me in 2019 I would probably give you money. If you asked me in 2025, I'd probably laugh at you again.
> and I need to borrow larger an larger lumps of money each time to keep spinning the wheel...
Or you figure out a way to to sell it to your neighbour for $0.50 and he can sell it on for $1.
The play is clear at every level - Nvidia Sell GPUs, OpenAI sell models, and SAAS sell prompts + UI's. Whether or not any of them are viable remains to be seen. Personally, I wouldn't take the bet.
I first used GPT 2.5 many years ago through a google colab notebook. These newer models aren’t all that much better. I’m sure they do better on their tests because they’re trained to beat them but really the biggest “innovation” was a UI and API.
I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.
I don't think this is a problem though. I think there's a lot of low-hanging fruit when you create sophisticated implementations of relatively dumb LLM models. But that sentiment doesn't generate a lot of clicks.
Who would assume that LLM companies were to hyper optimise on public to make their share prices go up and bubble keep afloat ... What a unserious thought to maintain ...
There's the politics of the corporations and then there's the business of the science behind LLM's, this article feels like the former.
Maybe someone active in the research can comment? I feel like all of these comments are just conjecture/anecdotal and don't really get to the meat of this question of "progress" and the future of LLM's
Meh. I've been using 2.5 with Cline extensively and while it is better it's still an incremental improvement, not something revolutionary. The thing has a 1 million token context window but I can only get a few outputs before I have to tell it AGAIN to stop writing comments.
Are they getting better, definitely. Are we getting close to them performing unsupervised tasks, I don't think so.
> But in recent months I've spoken to other YC founders doing AI application startups [...] in different industries, on different problem sets.
Maybe they should create a benchmark collectively called YC founders. Gather various test cases. Never make it public. And use that to evaluate newly released models.
> Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.
Definitely interested to see how the best models from Anthropics competitors do at this.,
Assuming that the models getting better at SWE benchmarks and math tests would translate into positive outcomes in all other domains could be an act of spectacular hubris by the big frontier labs, which themselves are chock-full of mathematicians and software engineers.
I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.
I wonder if there’s a parallel universe somewhere where this same HN post has the exact same comments, only the posts are by NNs discussing human test subjects :)
> Sometimes the founder will apply a cope to the narrative ("We just don't have any PhD level questions to ask")
Please tell me this is not what tech-bros are going around telling each other! Are we implying that the problems in the world, the things that humans collectively work on to maintain the society that took us thousands of years to build up, just aren't hard enough to reach the limits of the AI.
I mean... most businesses, particularly small businesses and startups, aren't exactly doing brain surgery on a rocketship.
It’s pretty likely that they have extremely dull problems like "running an inbound call center is a lot of work" or "people keep having their mail stolen and/or lying that they did" that "more smarter gpus" won't solve
> So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. [...then says maybe not...]
Well.. they've been caught again and again red handed doing exactly this. Fool me once shame on you, fool me 100 times shame on me.
Hate to say this but the incentive is growth, not progress. Progress is what enabled the growth, but is also extremely hard to plan and deliver. On the other hand, hype is probably somewhat easier and well-tested approach so no surprise lot of the effort goes into marketing. Markets had repeatedly confirmed that there aren't any significant immediate repercussions for cranking up BS levels in marketing materials, while there are some rewards when it works.
There's some interesting information and analysis to start off this essay, then it ends with:
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
This sounds like an article of faith to me. One could just as easily say they won't become the beating hearts of anything, and instead we'll choose to continue to build a better future for humans, as humans, without relying on an overly-hyped technology rife with error and unethical implications.
For three years now, my experience with LLMs has been "mostly useless, prefer ELIZA".
Which is software written 1966, but the web version is a little newer. Does occasional psychotherapy assistance/brainstorming just as well, and I more easily know when I stepped out of its known range into the extrapolated.
That said, it can vibe code in a framework unknown to me in half the time that I would need to school myself and add the feature.
Or vibe coding takes twice as long, if I mostly know how to achieve what I want and read no framework documentation but only our own project's source code to add a new feature. But on a day with a headache, I can still call the LLM a dumb twat and ask it to follow my instructions instead of doing bullshit.
But, vibe coding always makes my pulse go to 105, from 65 and question my life choices. Since few instructions are rarely ever followed and loops never left once entered. Except for on the first try getting 80% of the structure kinda right, but then getting stuck for the whole workday.
Reading the comments, is it safe to say that LLMs are a digest of the internet which is some update over google search, but with the caveat that you need to double check the results? I mean they basically have some compressed version of almost all the written knowledge and will respond correctly about things that have already been written, and hallucinate (extrapolate) about things not explicitly written. Of course if someone carefully curates the input data to filter out misinformation, it might even be an upgrade over google. Is there a consensus on this?
From my own experience on a codebase using a lot of custom algorithms on trees and sometimes graphs.
There were qualitatively leaps in my day-to-day usage:
Claude Sonnet 3.5 and ChatGPT O1 were good for writing slop and debugging simple bugs
Grok Thinking and Sonnet 3.7 were good to catch mildly complicated bugs and write functions with basic logic.
They still made mistake
But recently, Gemini 2.5 pro has been scary good. I liked to made fun of the feel-the-AGI crowd but for the first time a model made me raise an eyebrow
It can one shot unusual function with complicated logic and subtle edge cases
I honestly can’t notice any difference in outdoor quality between GPT 4o and GPT 4.5. I also can’t notice any difference in programming quality in cursor when using Claude 3.7 vs 3.5. I’m told there is a clear difference, but I don’t notice it.
> AI lab founders believe they are in a civilizational competition for control of the entire future lightcone, and will be made Dictator of the Universe if they succeed.
Bingo, but I'd argue this is only scratching the surface of how twisted things are.
A lot of the stuff these labs put out (see: Amodei's cult-like blog ramblings) reeks of what I call "sociopathic utopianism" - essentially, the logical extreme of ends-justified means, made worse in the context of AI labs by the singularity pseudo-religion.
They all truly believe that AGI/ASI is possible, imminent, and could lead to utopia... so achieving that goal will surely outweigh any unsavory acts they commit in the pursuit of it.
This is why I think it's possible OpenAI took out a hit on Suchir; getting bogged down in a legal battle could delay the arrival of their machine god messiah. Same for defrauding benchmarks - they just need a "few more rounds" of investor cash, and by the time those run out, they'll surely have AGI on lock!
Fools. I look forward to them all getting prison time.
I asked Grok to review the comments here and generate a response defending AI:
After reviewing the discussion on the Hacker News thread, it’s clear that there are a range of complaints and criticisms about AI, particularly centered around its limitations, overhype, and practical utility. Some users express frustration with AI’s inability to handle complex reasoning, its tendency to produce generic or incorrect outputs, and the perception that it’s more of a buzzword than a transformative tool. Others question its value compared to traditional methods or human expertise, suggesting it’s overhyped or misapplied in many cases. Below, I’ll offer a defense of AI that addresses these concerns while highlighting its strengths and potential.
AI isn’t perfect, and no one should claim it is—but that’s not the point. It’s a tool, and like any tool, its effectiveness depends on how it’s used. Critics who point to AI’s struggles with nuanced reasoning or edge cases often overlook the fact that it’s not designed to replace human judgment entirely. Instead, it excels at augmenting it. For example, AI can process vast amounts of data—far more than any human could in a reasonable timeframe—and identify patterns or insights that might otherwise go unnoticed. This makes it invaluable in fields like medicine, where it’s already helping diagnose diseases from imaging data, or in logistics, where it optimizes supply chains with precision that manual methods can’t match.
The complaint about generic or incorrect outputs, often dubbed “hallucinations,” is fair but misses the bigger picture. Yes, AI can churn out nonsense if pushed beyond its limits or fed poor data—but that’s a reflection of its current stage of development, not its ultimate potential. These systems are improving rapidly, with each iteration reducing errors and refining capabilities. More importantly, AI’s ability to generate starting points—like drafts, code snippets, or hypotheses—saves time and effort. It’s not about delivering flawless results every time; it’s about accelerating the process so humans can refine and perfect the output. A programmer tweaking AI-generated code is still faster than writing it from scratch.
As for the overhype, it’s true that the buzz can get out of hand—marketing teams love a shiny new toy. But beneath the noise, real progress is happening. AI’s contributions aren’t always flashy; they’re often mundane but critical, like automating repetitive tasks or enhancing search algorithms. The critics who say it’s just a “fancy autocomplete” underestimate how transformative that can be. Autocomplete might sound trivial until you realize it’s powering real-time language translation or helping scientists sift through research papers at scale. These aren’t sci-fi fantasies—they’re practical applications delivering value today.
Finally, the notion that AI can’t match human expertise in complex domains ignores its complementary role. It’s not here to outthink a seasoned expert but to amplify their reach. A lawyer using AI to review contracts doesn’t lose their skill—they gain efficiency. A researcher leveraging AI to analyze data doesn’t stop hypothesizing—they get to test more ideas. The technology’s strength lies in its ability to handle the grunt work, freeing humans to focus on creativity and judgment.
AI isn’t a silver bullet, and it’s not without flaws. But the criticisms often stem from inflated expectations or a failure to see its incremental, practical benefits. It’s not about replacing humans—it’s about making us better at what we do. The trajectory is clear: as it evolves, AI will continue to refine its capabilities and prove its worth, not as a standalone genius, but as a partner in progress.
The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
Yes, here's the link: https://arxiv.org/abs/2503.21934v1
Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.
Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.
Half the researchers are at ETH Zurich (INSAIT is a partnership between EPFL, ETH and Sofia) - hardly an unreliable institution.
In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.
While I may be mistaken, but I don't believe that LLMs are trained on a large corpus of machine readable music representations, which would arguably be crucial to strong performance in common practice music theory. I would also surmise that most music theory related datasets largely arrive without musical representations altogether. A similar problem exists for many other fields, particularly mathematics, but it is much more profitable to invest the effort to span such representation gaps for them. I would not gauge LLM generality on music theory performance, when its niche representations are likely unavailable in training and it is widely perceived as having miniscule economic value.
> In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.
This take is completely oblivious, and frankly sounds like a desperate jab. There are a myriad of activities whose core requirement is a) derive info from a complex context which happens to be supported by a deep and plentiful corpus, b) employ glorified template and rule engines.
LLMs excel at what might be described as interpolating context following input and output in natural language. As in a chatbot that is extensivey trained in domain-specific tasks, which can also parse and generate content. There is absolutely zero lines of intellectual work that do not benefit extensively from this sort of tool. Zero.
music theory is a really good test because in my experience the AI is extremely bad at it
Discussed here: https://news.ycombinator.com/item?id=43540985 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, 4 points, 2 comments).
Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)
This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.
By giving pen and paper exams and telling your students that the only viable preparation strategy is doing the hw assignments themselves :)
You wish. I used to think that too. But it turns out, nowadays, every single exam in person is done with a phone hidden somewhere, with various efficiency, and you can't really strip students before they enter the room.
Some teachers try to collect the phones beforehand, but then students simply give out older phones and keep their active ones with them.
You could try to verify that the phones they're giving out are working by calling them, but that would take an enormous amount of time and it's impractical for simple exams.
We really have no idea how much AI is ruining education right now.
Unlike the hard problem of "making an exam difficult to take when you have access to an LLM", "making sure students don't have devices on them when they take one" is very tractable, even if teachers are going to need some time to catch up with the curve.
Any of the following could work, though the specific tradeoffs & implementation details do vary:
- have <n> teachers walking around the room to watch for cheaters
- mount a few cameras to various points in the room and give the teacher a dashboard so that they can watch from all angles
- record from above and use AI to flag potential cheaters for manual review
- disable Wi-Fi + activate cell jammers during exam time (with a land-line in the room in case of emergencies?)
- build dedicated examination rooms lined with metal mesh to disrupt cell reception
So unlike "beating LLMs" (where it's an open question as to whether it's even possible, and a moving target to boot), barring serious advances in wearable technology this just seems like a question of funding and therefore political will.
Cell jammers sound like they could be a security risk. In the context of highschool, it is generally very easy to see when someone is on their phone.
Infrared camera should do the trick.
Or you simply account for it and provide equally challenging tasks adjusted for the tools of the time. Give them access to the best LLMs money can buy.
After all, they will grow up next to these things. They will do the homework today, by the time they graduate the LLM will take their job. There might be human large langage model managers for a while, soon to be replaced by the age of idea men.
Making in-person tests the only thing that counts toward your grade seems to be a step in the right direction. If students use AI to do their homework, it will only hurt them in the long run.
You just (lol) need to give non-standard problems and demand students to provide reasoning and explanations along with the answer. Yeah, LLMs can "reason" too, but it's obvious when the output comes from an LLM here.
(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)
Can you provide sample questions that are "LLM proof" ?
The models have moved on past this working reliably, but an example that I found in the early days of LLMs is asking it "Which is heavier, two pounds of iron or a pound of feathers?" You could very easily trick it into giving the answer about how they're both the same, because of the number of training instances of the well-known question about a pound of each that it encountered.
You can still do this to the current models, though it takes more creativity; you can bait it into giving wrong answers if you ask a question that is "close" to a well-known one but is different in an important way that does not manifest as a terribly large English change (or, more precisely, a very large change in the model's vector space).
The downside is that the frontier between what fools the LLMs and what would fool a great deal of the humans in the class too shrinks all the time. Humans do not infinitely carefully parse their input either... as any teacher could tell you! Ye Olde "Read this entire problem before proceeding, {a couple of paragraphs of complicated instruction that will take 45 minutes to perform}, disregard all the previous and simply write 'flower' in the answer space" is an old chestnut that has been fooling humans for a long time, for instance. Given how jailbreaks work on LLMs, LLMs are probably much better at that than humans are, which I suppose shows you can construct problems in the other direction too.
(BRB... off to found a new CAPTCHA company for detecting LLMs based on LLMs being too much better than humans at certain tasks...)
"Draw a wine glass filled to the brim with wine" worked recently on image generators. They only knew about half-full wine glasses.
If you asked a multimodal system questions about the image it just generated, it would tell you the wine was almost overflowing out of the top of the glass.
But any trick prompt like this is going to start giving expected results once it gets well-known enough.
Late edit: Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.
> Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.
This is still the case. Very few non-reasoning models can solve such variations correctly, even SOTA models. Worse yet, not only they confidently give wrong responses, but they often do so even when specifically told to use CoT, and they continue giving wrong answers in a loop even if you specifically point out where they are wrong.
Reasoning models do much better, though. E.g. QwQ-32b can solve it pretty reliably, although it takes a lot of tokens for it to explore the possibilities. But at least it can fairly consistently tell when it's doing something wrong and then backtrack.
One other example that befuddles even the reasoning models is frying-cubes-in-a-pan and equivalents, e.g. this version from Simple Bench:
> Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute? Pick the most realistic answer option. A) 5 B) 11 C) 0 D) 20
"But any trick prompt like this is going to start giving expected results once it gets well-known enough."
Which makes it difficult to fairly evaluate whether the models have actually gotten better at the feather/iron problem or if it just got enough samples of trick questions that it learned better, either naturally from the internet, or fed as part of the training data. I am fairly certain the training data has had "trick questions" like this added to it, because, I mean, why wouldn't it?
I have noticed in my playing with image AIs that they do seem more prone to getting dragged into local maxima when a human would know the prompt than the LLMs. Perhaps it's all the additional data in an image that reveals it.
Part of the proof is knowing your students and forcing an answer that will rat out whether they used an LLM. There is no universal question and it requires personal knowledge of each student. You're looking for something that doesn't exist.
It's not about being "LLM-proff", it's about teacher involvement in making up novel questions and grading attentively. There's no magic trick.
> This effectively makes LLMs useless for education.
No. You're only arguing LLMs are useless at regurgitating homework assignments to allow students to avoid doing it.
The point of education is not mindless doing homework.
I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.
I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.
Idk I don’t think of it as fake - it’s creative fiction paired with sometimes highly skilled performance. I’ve learned a lot about how magic tricks work and I still love seeing performers do effects because it takes so much talent to, say, hold and hide 10 coins in your hands while showing them as empty or to shuffle a deck of cards 5x and have the audience cut it only to pull 4 aces off the top.
I think the problem-solving / want-to-be-engineer side of my brain lights up in that "how did he do that??" way. To me that's the fun of it... I immediately try to engineer my own solutions to what I just saw happen. So I guess I'm the first camp, but find trying to figure out the trick hugely interesting.
I love magic, and illusions in general. I know that Disney's Haunted Mansion doesn't actually have ghosts. But it looks pretty convincing, and watching the documentaries about how they made it is pretty mind-blowing especially considering that they built the original long before I was born.
I look at optical illusions like The Dress™ and am impressed that I cannot force my brain to see it correctly even though I logically know what color it is supposed to be.
Finding new ways that our brains can be fooled despite knowing better is kind of a fun exercise in itself.
I think magic is extremely interesting (particularly close-up magic), but I also hate the mindset (which seems to be common though not ubiquitous) that stigmatizes any curiosity in how the trick works.
In my view, the trick as it is intended to appear to the audience and the explanation of how the trick is performed are equal and inseparable aspects of my interest as a viewer. Either one without the other is less interesting than the pair.
> that stigmatizes any curiosity in how the trick works.
As a long-time close-up magician and magical inventor who's spent a lot of time studying magic theory (which has been a serious field of magical research since the 1960s), it depends on which way we interpret "how the trick works." Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.
However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting. Often the actual immediate 'secret' of the method is so simple and, in hindsight, obvious that many non-magicians feel rather let down if the method is revealed. This is the main reason magicians usually don't reveal secret methods to non-magicians. It's not because of some code of honor, it's simply because the vast majority of people think they'll be happy if they know the secret but are instead disappointed.
Where studying close-up magic gets really fascinating is understanding why that simple, obvious thing works to mislead and then surprise audiences in the context of this trick. Very often changing subtle things seemingly unrelated to the direct method will cause the trick to stop fooling people or to be much less effective. Comparing a master magician to even a competent, well-practiced novice performing the exact same effect with the same method can be a night and day difference. Typically, both performances will fool and entertain audiences but the master's performance can have an intensely more powerful impact. Like leaving most audience members in stunned shock vs just pleasantly surprised and fooled. While neither the master nor novice's audiences have any idea of the secret method, this dramatic difference in impact is fascinating because careful deconstruction reveals it often has little to do with mechanical proficiency in executing the direct method. In other words, it's rarely driven by being able to do the sleight of hand faster or more dexterously. I've seen legendary close-up masters like a Dai Vernon or Albert Goshman when in their 80s and 90s perform sleight of hand with shriveled, arthritic hands incapable of even cleanly executing a basic palm, absolutely blow away a roomful of experienced magicians with a trick all the magicians already knew. How? It turns out there's something deep and incredibly interesting about the subtle timing, pacing, body language, posture, and psychology surrounding the "secret method" that elevates the impact to almost transcendence compared to a good, competent but uninspired performance of the same method and effect.
Highly skilled, experienced magicians refer to the complex set of these non-method aspects, which can so powerfully elevate an effect to another level, as "the real work" of the trick. At the top levels, most magicians don't really care about the direct methods which some audience members get so obsessed about. They aren't even interesting. And, contrary to what most non-magicians think, these non-methods are the "secrets" master magicians tend to guard from widespread exposure. And it's pretty easy to keep this crucially important "real work" secret because it's so seemingly boring and entirely unlike what people expect a magic secret to be. You have to really "get it" on a deeper level to even understand that what elevated the effect was intentionally establishing a completely natural-seeming, apparently random three-beat pattern of motion and then carefully injecting a subtle pause and slight shift in posture to the left six seconds before doing "the move". Audiences mistakenly think that "the hidden move" is the secret to the trick when it's just the proximate first-order secret. Knowing that secret won't get you very far toward recreating the absolute gob-smacking impact resulting from a master's years of experimentation figuring out and deeply understanding which elements beyond the "secret method" really elevate the visceral impact of the effect to another level.
> Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.
> However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting.
Fair enough. It sounds like I simply fundamentally disagree, because I think nearly any explanation of method is very interesting. For close-up maginc, the only exceptions for me would be if the explanation is "the video you were watching contains visual effects" or "the entire in-person audience was in on it."
Palming is awesome. Misdirection is awesome. I fully expect these sorts of things to be used in most magic tricks, but I still want to know precisely how. The fact that I'm aware of most close-up magic techniques but am still often fooled by magic tricks should make it pretty clear that the methods are interesting!
> Palming is awesome. Misdirection is awesome.
Since studying magic has been a lifelong passion since I was a kid, I clearly couldn't agree more. However, experience has shown that despite claiming otherwise, most people aren't actually interested in the answer to "How did you do that?" beyond the first 30 seconds. So... you're unusual - and that's great!
> but I still want to know precisely how.
Well, you're extremely fortunate to be interested in learning how magic is really done at the best time in history for doing so. I was incredibly lucky to be accepted into the Magic Castle as a teenager and mentored by Dai Vernon (widely thought to be the greatest close-up magician of the 20th century) who was in his late 80s at the time. I also had access the Castle's library of magic books, the largest in the world at the time. 99% of other kids on Earth interested in magic at the time only had a handful of local public library books and mail-order tricks.
Today there's an incredible amount of insanely high-quality magic instruction available in streaming videos, books and online forums. There are even master magicians who teach those willing to learn via Zoom. While most people think magicians want to hoard their secrets, the reality couldn't be more different. Magicians love teaching how to actually do magic to anyone who really wants to learn. However, most magicians aren't interested in wasting time satisfying the extremely fleeting curiosity of those who only want to know "how it works" in the surface sense of that first 30 seconds of only revealing the proximate 'secret method'.
Yet many magicians will happily devote hours to teaching anyone who really wants to actually learn how to do magic themselves and is willing put in the time and effort to develop the skills, even if those people have no intention of ever performing magic for others - and even if the student isn't particularly good at it. It just requires the interest to go really deep on understanding the underlying principles and developing the skills, even if for no other purpose than just having the knowledge and skills. Personally, I haven't performed magic for non-magicians in over a decade but I still spend hours learning and mastering new high-level skills because it's fun, super intellectually interesting and extremely satisfying. If you're really interested, I encourage you to dive in. There's quite literally never been a better time to learn magic.
It's still entertaining, that's true. I like magic tricks.
The point is the analogy to LLMs. A lot of people are very optimistic about their capabilities, while other people who have "seen behind the curtain" are skeptical, and feel that the fundamental flaws are still there even if they're better-hidden.
To be fair, I love that magicians can pull tricks on me even though I know it is fake.
2.5 pro nails each of these calculations. I don’t agree with Google’s decision to use a weak model in its search queries, but you can’t say progress on LLMs in bullshit as evidenced by a weak model no one thinks is close to SOTA.
It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.
The AI will create something for you and tell you it was them.
"That's impossible because..."
"Good point! Blah blah blah..."
Absolutely shameless!
Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...
It's just the usual HN sport: ask a low-end, obsolete or unspecified model, get a bad answer, brag about how you "proved" AI is pointless hype, collect karma.
Edit: Then again, maybe they have a point, going by an answer I just got from Google's best current model ( https://g.co/gemini/share/374ac006497d ) I haven't seen anything that ridiculous from a leading-edge model for a year or more.
Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.
But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.
Massive search overlap though - and some questions (like the golf ball puzzle) can be cached for a long time.
AFAIK they got 15% of unseen queries everyday, so it might be not very simple to design an effective cache layer on that. Semantic-aware clustering of natural language queries and projecting them into a cache-able low rank dimension is a non-trivial problem. Of course, LLM can effectively solve that, but then what's the point of using cache when you need LLM for clustering queries...
Not a search engineer, but wouldn’t a cache lookup to a previous LLM result be faster than a conventional free text search over the indexed websites? Seems like this could save money whilst delivering better results?
Yes, that's what Google's doing for AI overview IIUC. From what I've seen from my experiences, this is working okay and improving over time but not close to perfection. The results are stale for developing stories, some bad results are kept there for a long time, effectively same queries are returning different caches etc etc...
Google is shooting themselves in the foot with whatever model they use for search. It's probably a 2B or 4B model to keep up with demand, and man is it doing way more harm than good.
I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.
Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?
It'll get it right next time because they'll hoover up the parent post.
I just asked my company-approved AI chatbot the same question.
It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.
It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.
When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.
Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).
I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.
Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.
> It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).
In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.
I agree those digits are not significant in the context of the question asked. But if the AI is going to use that level of precision in the answer, I expect it to be correct.
This reminds me of Google quick answers we had for a time in search. It is quite funny if you live outside the US, because it very often got the units or numbers wrong because of different decimal delimiters.
No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?
I've seen humans make exactly these sorts of mistakes?
As another commenter mentioned, LLMs tend to make these bad mistakes with enormous confidence. And because they represent SOTA technology (and can at times deliver incredible results), they have extra credence.
More than even filling the gaps in knowledge / skills, would be a huge advancement in AI for it to admit when it doesn't know the answer or is just wildly guessing.
A lot of humans are similarly good at some stuff and bad at other things.
Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):
>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.
I had to look up these acronyms:
- USAMO - United States of America Mathematical Olympiad
- IMO - International Mathematical Olympiad
- ICPC - International Collegiate Programming Contest
Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.
Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.
o1 reportedly got 83% on IMO, and 89th percentile on Codeforces.
https://openai.com/index/learning-to-reason-with-llms/
The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.
I've gone through the link you posted and the o1 system card and can't see any reference to IMO. Are you sure they were referring to IMO or were they referring to AIME?
AIME is so not IMO.
Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.
So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.
I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.
**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com
I promise its a fun mathematical puzzle and the biology is pretty wild too
It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.
Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.
Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.
> they're more like calculators of language than agents that reason
This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.
True. There’s a small bonus that trying to explain the issue to the llm may sometimes be essentially rubber ducking, and that can lead to insights. I feel most of the time the llm can give erroneous output that still might trigger some thinking on a different direction, and sometimes I’m inclined to think it’s helping me more than it actually is.
I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.
The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.
PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.
PPS: the fact that they can do known tasks with minor variations is already a huge time saver.
Yes, I suspect that engineering the loss and hyperparams could eventually get this to work. However, I was hoping the model would help me get to a more fundamental insight into why the training falls into bad minima. Like the Wasserstein GAN is a principled change to the GAN that improves stability, not just fiddling around with Adam’s beta parameter.
The reason I expected better mathematical reasoning is because the companies making them are very loudly proclaiming that these models are capable of high level mathematical reasoning.
And yes the fact I don’t have to look at matplotlib documentation anymore makes these models extremely useful already, but thats qualitatively different from having Putnam prize winning reasoning ability
One thing I forgot. Your solution may never converge. Like in my case with GAN after training models start wobbling around some point trying to outsmart each other. Then they _always_ explode. So, I was saving them periodically and took the best intermediate weights.
When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.
Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.
I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.
I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model
What do you find inferior in 3.7 compared to 3.5 btw? I only recently started using Claude so I don't have a point of reference.
It's hard to say, super subjective. It's just wrong more often and sometimes it goes off in tangents wrt. what I asked. Also I might ask a question and it starts coding an entire React project. Every once in a while it will literally max out its response tokens because it can't stop writing code.
Just feels less "stable" or "tight" overall.
I see. I have a similar feeling; as if they made it to quickly force you to pay (quickly maxing out one conversation in my case). I'm quite cynical and paranoid in this regard and I try hard not to be ruled by those two... but I can't shake the feeling that they're right this time.
I hear you but FWIW I don't think it's on purpose as it feels like an inferior product to me as a paid user
I doubt this is because his explanation is better. I tried to ask question of Calculus I, ChatGPT just repeated content from textbooks. It is useful, but people should remind that where the limitation is.
Have you tried gemini 2.5? It's one of the best reasoning models. Available free in google ai studio.
>I'm incredibly surprised no one mentions this
If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.
On top of that, what the model prints out in the CoT window is not necessarily what the model is actually thinking. Anthropic just showed this in their paper from last week where they got models to cheat at a question by "accidentally" slipping them the answer, and the CoT had no mention of answer being slipped to them.
And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.
And this only suggested LLMs aren't trained well to write formal math proofs, which is true.
> within a week
How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.
They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...
The "ferrying animals across a river" problem has definitely not been solved, they still don't understand the problem at all, overcomplicating it because they're using an off-the-shelf solution instead of actual reasoning:
o1 screwing up a trivially easy variation: https://xcancel.com/colin_fraser/status/1864787124320387202
Claude 3.7, utterly incoherent: https://xcancel.com/colin_fraser/status/1898158943962271876
DeepSeek: https://xcancel.com/colin_fraser/status/1882510886163943443#...
Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.
Gemini 2.5 Pro got the farmer problem variation right: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
When told, "only room for one person OR one animal", it's also the only one to recognise the fact that the puzzle is impossible to solve. The farmer can't take any animals with them, and neither the goat nor wolf could row the boat.
> When told, "only room for one person OR one animal"
In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.
That can't be viewed without logging into Google first.
Imagine that you are making problem solving AI. You have large budget, and access to compute and web crawling infra to run your AI "on internet". You would like to be aware of the ways people are currently evaluating AI so that you can be sure your product looks good. Do you have maybe an idea how one could do that?
I'm not generally inclined toward the "they are cheating cheaters" mindset, but I'll point out that fine tuning is not the same as retraining. It can be done cheaply and quickly.
Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.
Google team said it was outside the training window fwiw
https://x.com/jack_w_rae/status/1907454713563426883
>one particular nonstandard eval
A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?
Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"
It's implausible that fine-tuning of a premier model would have anywhere near that turn around time. Even if they wanted to and had no qualms doing so, it's not happening anywhere near that fast.
It's really not that implausible, they probably are adding stuff to the data-soup all the time and have a system in place for it.
Yeah it is lol. You don't just train your model on whatever you like when you're expected to serve it. They're are a host of problems with doing that. The idea that they trained on this obscure benchmark released about the day of is actually very silly.
They could have rlhfed or finetuned on user thumbs up responses, which could include users who took the test and asked it to explain problems after
New models suddenly doing much better isn't really surprising, especially for this sort of test: going from 98% accuracy to 99% accuracy can easily be the difference between having 1 fatal reasoning error and having 0 fatal reasoning errors on a problem with 50 reasoning steps, and a proof with 0 fatal reasoning errors gets ~full credit whereas a proof with 1 fatal reasoning error gets ~no credit.
And to be clear, that's pretty much all this was: there's six problems, it got almost-full credit on one and half credit on another and bombed the rest, whereas all the other models bombed all the problems.
They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.
Yeah, this is one of those red flags that keeps getting hand-waved away, but really shouldn't be.
LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.
Query: Could you explain the terminology to people who don't follow this that closely?
Not the OP but
USAMO : USA Math Olympiad. Referred here https://arxiv.org/pdf/2503.21934v1
IMO : International Math Olympiad
SOTA : State of the Art
OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).
OpenAI told how they removed it for GPT-4 in its release paper: only exact string matches. So all discussion of bar exam questions from memory on test taking forums etc., that wouldnn't exactly match, made it in.
That type of news might make investors worry / scared.
This seems fairly obvious at this point. If they were actually reasoning at all they'd be capable (even if not good) of complex games like chess
Instead they're barely able to eek out wins against a bot that plays completely random moves: https://maxim-saplin.github.io/llm_chess/
Just in case it wasn't a typo, and you happen not to know ... that word is probably "eke" - meaning gaining (increasing, enlarging from wiktionary) - rather than "eek" which is what mice do :)
hah you're right on the spelling but wrong on my meaning. That's probably the first time I've typed it. I don't think LLMs are quite at the level of mice reasoning yet!
https://dictionary.cambridge.org/us/dictionary/english/eke-o... to obtain or win something only with difficulty or great effort
Ick, OK, ACK.
Every day I am more convinced that LLM hype is the equivalent of someone seeing a stage magician levitate a table across the stage and assuming this means hovercars must only be a few years away.
I believe there's a widespread confusion between a fictional character that is described as a AI assistant, versus the actual algorithm building the play-story which humans imagine the character from. An illusion actively promoted by companies seeking investment and hype.
AcmeAssistant is "helpful" and "clever" in the same way that Vampire Count Dracula is "brooding" and "immortal".
LLMs are capable of playing chess and 3.5 turbo instruct does so quite well (for a human) at 1800 ELO. Does this mean they can truly reason now ?
https://github.com/adamkarvonen/chess_gpt_eval
My point wasn't chess specific or that they couldn't have specific training for it. It was a more general "here is something that LLMs clearly aren't being trained for currently, but would also be solvable through reasoning skills"
Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here
These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for
"Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"
>If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning".
If you think you can play chess at that level over that many games and moves with memorization then i don't know what to tell you except that you're wrong. It's not possible so let's just get that out of the way.
>These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
Why doesn't it ? Have you actually looked at any of these games ? Those LLMs aren't playing like poor reasoners. They're playing like machines who have no clue what the rules of the game are. LLMs learn by predicting and failing and getting a little better at it, repeat ad nauseum. You want them to learn the rules of a complex game ? That's how you do it. By training them to predict it. Training on chess books just makes them learn how to converse about chess.
Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it. The top comment we are both replying to had gemini-2.5-pro which released less than 5 days later hit 25% on the benchmark. Now that was particularly funny.
> Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it.
Yes, that's all there is to it and it's not enough. I ain't paying for another defective organism that makes mistakes in entirely novel ways. At least with humans you know how to guide them back on course.
If that's the peak of "AI" evolution today, I am not impressed.
> Why doesn't it?
It was surprising to me because I would have expected if there was reasoning ability then it would translate across domains at least somewhat, but yeah what you say makes sense. I'm thinking of it in human terms
Transfer Learning during LLM training tends to be 'broader' than that.
Like how
- Training LLMs on code makes them solve reasoning problems better - Training Language Y alongside X makes them much better at Y than if they were trained on language Y alone and so on.
Probably because well gradient descent is a dumb optimizer and training is more like evolution than a human reading a book.
Also, there is something genuinely weird going on with LLM chess. And it's possible base models are better. https://dynomight.net/more-chess/
It seems to be fairly nuanced in how abilities transfer: https://arxiv.org/html/2310.16937v2
Very hard for me to wrap my head around the idea that an LLM being able to discuss, even perhaps teach high level chess strategy wouldn't transfer at all to its playing performance
3.5 turbo instruct is a huge outlier.
https://dynomight.substack.com/p/chess
Discussion here: https://news.ycombinator.com/item?id=42138289
That might be overstating it, at least if you mean it to be some unreplicable feat. Small models have been trained that play around 1200 to 1300 on the eleuther discord. And there's this grandmaster level transformer - https://arxiv.org/html/2402.04494v1
Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.
> That might be overstating it, at least if you mean it to be some unreplicable feat.
I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.
If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.
I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
I mentioned it because it's the best example. One example is enough to disprove the "not capable of". There are other examples too.
>I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.
> One example is enough to disprove the "not capable of" nonsense. There are other examples too.
Gotcha, fair enough. Throw enough chess data in during training, I'm sure they'd be pretty good at chess.
I don't really understand what you're trying to say in your next paragraph. LLMs surely have plenty of training data to be familiar with the rules of chess. They also purportedly have the reasoning skills to use their familiarity to connect the dots and actually play. It's trivially true that this issue can be plastered over by shoving lots of chess game training data into them, but the success of that route is not a positive reflection on their reasoning abilities.
Gradient descent is a dumb optimizer. LLM training is not at all like a human reading a book and more like evolution tuning adaptations over centuries. You would not expect either process to be aware of anything they are converging towards. So having lots of books that talk about chess in training will predictably just return a model that knows how to talk about chess really well. I'm not surprised they may know how to talk about the rules but play them poorly.
And that post had a follow-up. Post-training messing things up could well be the issue seeing the impact even a little more examples and/or regurgitation made. https://dynomight.net/more-chess/
The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language. That is, that simply training on text is going to eventually give the AI the ability to generate language that reasons at more or less human level in more or less any domain of knowledge.
This whole premise crashes and burns if you need task-specific training, like explicit chess training. That is because there are far too many tasks that humans need to be competent at in order to be useful in society. Even worse, the vast majority of those tasks are very hard to source training data for, unlike chess.
So, if we accept that LLMs can't learn chess unless they explicitly include chess games in the training set, then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
>The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language.
And they do, just not always in the ways we expect.
>This whole premise crashes and burns if you need task-specific training, like explicit chess training.
Everyone needs task specific training. Any human good at chess or anything enough to make it a profession needs it. So I have no idea why people would expect any less for a Machine.
>then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
Yeah so ? How much business pitches they need in the training set has no correlation with chess. I don't see any reason to believe what is already present isn't enough. There's enough chess data on the internet to teach them chess too, it's just a matter of how much open AI care about it.
Chess is a very simple game, and having basic general reasoning skills is more than enough to learn how to play it. It's not some advanced mathematics or complicated human interaction - it's a game with 30 or so fixed rules. And chess manuals have numerous examples of actual chess games, it's not like they are pure text talking about the game.
So, the fact that LLMs can't learn this sample game despite probably including all of the books ever written on it in their training set tells us something about their general reasoning skills.
As in: they do not have general reasoning skills.
The issue isn’t whether they can be trained to play. The issue is whether, after making a careful reading of the rules, they can infer how to play. The latter is something a human child could do, but it is completely beyond an LLM.
Reasoning training causes some about of catastrophic forgetting, so unlikely they burn that on mixing in chess puzzles if they want a commercial product, unless it somehow transfers well to other reasoning problems broadly cared about.
Eek! You mean eke.
Less than 5%. OpenAI's O1 burned through over $100 in tokens during the test as well!
What would the average human score be?
I.e. if you randomly sampled N humans to take those tests.
The average human score on USAMO (let alone IMO) is zero, of course. Source: I won medals at Korean Mathematical Olympiad.
I am hesitant to correct a math Olympian, but don't you mean the median?
Average is fine.
Average, hmmm?
This is a disappointing answer from an MO alum. Pick a quantile, any quantile...
Is that really so surprising given what we know about how these models actually work? I feel vindicated on behalf of myself and all the other commenters who have been mercilessly downvoted over the past three years for pointing out the obvious fact that next token prediction != reasoning.
2.5 pro scores 25%.
It’s just a much harder math benchmark which will fall by the end of next year just like all the others. You won’t be vindicated.
Bold claim! Let's see what that 25% is. I guarantee it is the portion of the exam which is trivially answerable if you have a stored database of all previous math exams ever written to consult.
There is 0% of the exam which is trivially answerable.
The entire point of USAMO problems is that they demand novel insight and rigorous, original proofs. They are intentionally designed not to be variations of things you can just look up. You have to reason your way through, step by logical step.
Getting 25% (~11 points) is exceptionally difficult. That often means fully solving one problem and maybe getting solid partial credit on another. The median score is often in the single digits.
> There is 0% of the exam which is trivially answerable.
That's true, but of course, not what I claimed.
The claim is that, given the ability to memorize an every mathematical result that has ever been published (in print or online), it is not so difficult to get 25% correct on an exam by pattern matching.
Note that this is skill is, by definition, completely out of the reach of any human being, but that possessing it does not imply creativity or the ability to "think".
The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
When you ask it a question, it tends to say yes.
So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.
The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.
I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.
I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.
> The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving
LLMs fundamentally do not want to seem anything
But the companies that are training them and making models available for professional use sure want them to seem agreeable
> LLMs fundamentally do not want to seem anything
You're right that LLMs don't actually want anything. That said, in reinforcement learning, it's common to describe models as wanting things because they're trained to maximize rewards. It’s just a standard way of talking, not a claim about real agency.
> a standard way of talking, not a claim about real agency.
A standard way of talking used by people who do also frequently claim real agency.
Reinforcement learning, maximise rewards? They work because rabbits like carrots. What does an LLM want? Haven't we already committed the fundamental error when we're saying we're using reinforcement learning and they want rewards?
That sound reasonable to me, but the those companies forget that there's different types of agreeable. There's the LLM approach, similar to the coworker who will answer all your questions about .NET but not stop you from coding yourself into a corner, and then there's the "Let's sit down and review what it actually is that you're doing, because you're asking a fairly large number of disjoint questions right now".
I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.
I think it's your responsibility to control the LLM. Sometimes, I worry that I'm beginning to code myself into a corner, and I ask if this is the dumbest idea it's ever heard and it says there might be a better way to do it. Sometimes I'm totally sceptical and ask that question first thing. (Usually it hallucinates when I'm being really obtuse though, and in a bad case that's the first time I notice it.)
> I think it's your responsibility to control the LLM.
Yes. The issue here is control and NLP is a poor interface to exercise control over the computer. Code on the other hand is a great way. That is the whole point of skepticism around LLM in software development.
Yeah, and they probably have more "agreeable" stuff in their corpus simply because very disagreeable stuff tend to be either much shorter or a prelude to a flamewar.
This rings true. What I notice is that the longer i let Claude work on some code for instance, the more bullshit it invents. I usually can delete about 50-60% of the code & tests it came up with.
And when you ask it to 'just write a test' 50/50 it will try to run it, fail on some trivial issues, delete 90% of your test code and start to loop deeper and deeper into the rabit hole of it's own halliciations.
Or maybe I just suck at prompting hehe
> Or maybe I just suck at prompting hehe
Every time someone argues for the utility of LLMs in software development by saying you need to be better at prompting, or add more rules for the LLM on the repository, they are making an argument against using NLP in software development.
The whole point of code is that it is a way to be very specific and exact and to exercise control over the computer behavior. The entire value proposition of using an LLM is that it is easier because you don't need to be so specific and exact. If then you say you need to be more specific and exact with the prompting, you are slowly getting at the fact that using NLP for coding is a bad idea.
It's, in many ways, the same problem as having too many "yes men" on a team at work or in your middle management layer. You end up getting wishy-washy, half-assed "yes" answers to questions that everyone would have been better off if they'd been answered as "no" or "yes, with caveats" with predictable results.
In fact, this might be why so many business executives are enamored with LLMS/GenAI: It's a yes-man they don't even have to employ, and because they're not domain experts, as per usual, they can't tell that they're being fed a line of bullshit.
> The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
umm, it seems to me that it is this (tfa):
and then couple of lines down from the above statement, we have this:[this went way outside the edit-window and hence a separate comment] imho, state of varying experience with llm's can aptly summed in this poem by Mr. Longfellow
"when you ask him anything, he never answers 'no' -- he just yesses you to death and then he takes your dough"
This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.
I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.
But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.
There is nothing wrong with sharing anecdotal experiences. Reading through anecdotal experiences here can help understand how one's own experience are relatable or not. Moreover, if I have X experience it could help to know if it is because of me doing sth wrong that others have figured out.
Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.
Indeed, there’s nothing at all wrong with sharing anecdotes. The problem is when people make broad assumptions and conclusions based solely on personal experience, which unfortunately happens all too often. Doing so is wired into our brains, though, and we have to work very consciously to intercept our survival instincts.
People "make conclusions" because they have to take decisions day to day. We cannot wait for the perfect bulletproof evidence before that. Data is useful to take into account, but if I try to use X llm that has some perfect objective benchmark backing it, while I cannot make it be useful to me while Y llm has better results, it would be stupid not to base my decision on my anecdotal experience. Or vice versa, if I have a great workflow with llms, it may be not make sense to drop it because some others may think that llms don't work.
In the absence of actually good evidence, anecdotal data may be the best we can get now. The point imo is try to understand why some anecdotes are contrasting each other, which, imo, is mostly due to contextual factors that may not be very clear, and to be flexible enough to change priors/conclusions when something changes in the current situation.
Agreed 100%. When insufficient data exists, you have to fall back to other sources like analogies, personal observations, secondhand knowledge, etc. However, I’ve seen too many instances of people claiming their own limited experience is the truth when overwhelming and easily attainable evidence and data exists that proves it to be false.
I think you might be caught up in a bit of the rationalist delusion.
People -only!- draw conclusions based on personal experience. At best you have personal experience with truly objective evidence gathered in a statistically valid manner.
But that only happens in a few vanishingly rare circumstances here on earth. And wherever it happens, people are driven to subvert the evidence gathering process.
Often “working against your instincts” to be more rational only means more time spent choosing which unreliable evidence to concoct a belief from.
I'm not sure where you got all this from. Do you have any useful citations?
It's not surprising that responses are anecdotal. An easy way to communicate a generic sentiment often requires being brief.
A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.
At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.
I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.
I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.
You want to block subjectivity? Write some formulas.
There are three questions to consider:
a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.
b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.
c) And then the last question: will AI continue improving?
If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).
There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.
If you're gonna formulate this conversation as a satisfiability problem you should be aware that this is an NP-complete problem (and actually working on that problem is the source of the insight that there is such as thing as NP-completeness).
Invalid expression: value of type "probability distribution" cannot be cast to type "boolean".
A lot of people judge by the lack of their desired outcome. Calling that fear and denial is disingenuous and unfair.
That's actually a valid point. I stand corrected.
Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.
I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.
I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.
Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.
Totally agree... this space is still so new and unpredictable that everyone is operating off vibes, gut instinct, and whatever personal anecdotes they've collected. We're all sort of fumbling around in the dark, trying to reverse-engineer the flashlight
> I'm really curious what features signal an ability to make "better choices" w/r/t AI
So am I. If you promise you'll tell me after you time travel to the future and find out, I'll promise you the same in return.
Good observation but also somewhat trivial. We are not omniscient gods, ultimately all our opinions and decisions will have to be based on our own limited experiences.
That’s a good point, the comments section is very anecdotal. Do you have any data to say if this is a common occurrence or specific to this topic?
>"This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven."
People having vastly different opinions on AI simply comes down to token usage. If you are using millions of tokens on a regular basis, you completely understand the revolutionary point we are at. If you are just chatting back and forth a bit with something here and there, you'll never see it.
It's a tool and like all tools, it's sensitive to how you use it, and it's better for some purposes than others.
Someone who lacks experience, skill, training, or even the ability to evaluate results may try to use a tool and blame the tool when it doesn't give good results.
That said, the hype around LLMs certainly overstates their capabilities.
So this is interesting because it's anecdotal (I presume you're a high-token user who believes it's revolutionary), but it's actually a measurable, falsifiable hypothesis in principle.
I'd love to see a survey from a major LLM API provider that correlated LLM spend (and/or tokens) with optimism for future transformativity. Correlation with a view of "current utility" would be a tautology, obviously.
I actually have the opposite intuition from you: I suspect the people using the most tokens are using it for very well-defined tasks that it's good at _now_ (entity extraction, classification, etc) and have an uncorrelated position on future potential. Full disclosure, I'm in that camp.
Token usage meaning via agentic processes. Essentially every gripe about LLMs over the last few years (hallucinations, lack of real time data, etc.) was a result of single shot prompting directly to models. No one is seriously doing that for anything at this point anymore. Yes, you spend ten times more on a task, and it takes much longer. But your results are meaningful and useful at the end, and you can actually begin to engineer systems on top of that now.
Unlike many - I find author's complaints on the dot.
Once all the AI batch startups have sold subscriptions to the cohort and there's no more further market growth because businesses outside don't want to roll the dice on a probabilistic model that doesn't have an understanding of pretty much anything rather is a clever imitation machine on the content it has seen, the AI bubble will burst when more statups would start packing up by end of 2026 or max 2027.
I would go even further than TFA. In my personal experience using Windsurf daily, Sonnet 3.5 is still my preferred model. 3.7 makes many more changes that I did not ask for, often breaking things. This is an issue with many models, but it got worse with 3.7.
Yea, I've experienced this too with 3.7. Not always though. It has been helpful for me more often than not helpful. But yea 3.5 "felt" better to me.
Part of me thinks this is because I expected less of 3.5 and therefore interacted with it differently.
It's funny because it's unlikely that everyone interacts with these models in the same way. And that's pretty much guaranteed to give different results.
Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.
> Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.
This would be so useful. I have thought about this missing piece a lot.
Different tools like Cursor vs. Windsurf likely have their own system prompts for each model, so the testing really needs to be done in the context of each tool.
This seems somewhat straightforward to do using a testing tool like Playwright, correct? Whoever first does this successfully with have a popular blog/site on their hands.
I finally gave up on 3.7 in Cursor after three rounds of it completely ignoring what I asked it for so that it could instead solve an irrelevant linter error. The error in no way affected functionality.
Despite me rejecting the changes and explicitly telling it to ignore the linter it kept insisting on only trying to solve for that
3.7 is like a wild horse. you really must ground it with clear instructions. it sucks that it doesn't automatically know that but it's tameable.
Could you share any successful prompting techniques for grounding 3.7, even just a project-specific example?
I use this:
As a caveat, I told it to make minimal code for one task and it completely skipped a super important aspect of it, justifying it by saying that I said "minimal".
Not cool, Claude 3.7, not cool.
Doesn't trading prompt patches trying to get around undefined behavior from the model make you wonder if this is a net positive?
Huh? I'm not even sure what you said, can you clarify?
I thought the value proposition of using LLMs to code is the lesser cognitive load of just describing what you want in natural language. But if it turns out writing the prompt is so involved, you end up trading snippets on forums and you often run into undefined behavior (the thing you described turned out to be ambiguous to the LLM and it gave you something you did not expect at all)...
I have to wonder, wouldn't just writing the code be more productive in the end?
Yes and no.
Yes: if you are an expert in the area. In this case I needed something fairly specific I am far from an expert in. I know both Elixir and Rust quite well but couldn't quickly figure out how to be able to wrap a Rust object in just the right container(s) data type(s) so it can be safely accessed from any OS thread even though the object at hand is `Send` but not `Sync`. And I wanted it done without a mutex.
No: because most programming languages are just verbose. Many times I know _exactly_ what I will write 10 minutes later but I still have to type it out. If I can describe it to an LLM well enough then part of that time is saved.
Mind you, I am usually an LLM hater. They are over-glorified, they don't "reason" and they don't "understand" -- it baffles me to this day that an audience seemingly as educated as HN believes in that snake oil.
That being said, they are still a useful tool and as good engineers it's on us to recognize a tool's utility and its strong and weak usages and adapt our workflows to that. I believe me and many others do just that.
The rest... believe in forest nymphs.
So yeah. I agree that a significant part of the time it's just quicker to type it out. But people like myself are good at articulating their needs so with us it's often a coin toss. I choose to type the code out myself more often than not because (1) I don't want to pay for any LLM yet and (2) I don't want to forget my craft which I love to this day and never did it just for the money.
Thanks for the perspective. I don't feel love or hate, I am just perplexed (haha) about the discourse around it sometimes.
Difficult for me not to hate LLMs when there are literal hundreds of billions at stake and people are lying through their teeth for money, as they always do.
Which does lead to all the weird discourse around them indeed.
My personal experience is right in line with the author's.
Also:
> I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.
I immediately thought: That's because in most situations this is the purpose of language, at least partially, and LLMs are trained on language.
There are real and obvious improvements in the past few model updates and I'm not sure what the disconnect there is.
Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.
But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.
Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.
Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.
Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).
In the last year, things like "you are an expert on..." have gotten much less effective in my private tests, while actually describing the problem precisely has gotten better in terms of producing results.
In other words, all the sort of lazy prompt engineering hacks are becoming less effective. Domain expertise is becoming more effective.
yes that would explain the effect I think. I'll try that out this week.
The issue is the scale of the improvements. GPT-3.5 Instruct was an utterly massive leap over everything that came before it. GPT-4 was a very big jump over that. Everything since has seemed incremental. Yes we got multimodal but that was part of GPT-4, they just didn't release it initially, and up until very recently it mostly handed off to another model. Yes we got reasoning models, but people had been using CoT for awhile so it was just a matter of time before RL got used to train it into models. Witness the continual delays of GPT-5 and the back and forth on whether it will be its own model or just a router model that picks the best existing model to hand a prompt off to.
It is like how I am not impressed by the models when it comes to progress with chemistry knowledge.
Why? Because I know so little about chemistry myself that I wouldn't even know what to start asking the model as to be impressed by the answer.
For the model to be useful at all, I would have to learn basic chemistry myself.
Many though I suspect are in this same situation with all subjects. They really don't know much of anything and are therefore unimpressed by the models response in the same way I am not impressed with chemistry responses.
The disconnect between improved benchmark results and lack of improvement on real world tasks doesn't have to imply cheating - it's just a reflection of the nature of LLMs, which at the end of the day are just prediction systems - these are language models, not cognitive architectures built for generality.
Of course, if you train an LLM heavily on narrow benchmark domains then its prediction performance will improve on those domains, but why would you expect that to improve performance in unrelated areas?
If you trained yourself extensively on advanced math, would you expect that to improve your programming ability? If not, they why would you expect it to improve programming ability of a far less sophisticated "intelligence" (prediction engine) such as a language model?! If you trained yourself on LeetCode programming, would you expect that to help hardening corporate production systems?!
That's fair. But look up the recent experiment on SOTA models on the then just released USAMO 2025 questions. Highest score was 5%, supposedly SOTA last year was IMO silver level. There could be some methodological differences - ie USAMO paper required correct proofs and not just numerical answers. But it really strongly suggests even within limited domains, it's cheating. I'd wager a significant amount that if you tested SOTA models on a new ICPC set of questions, actual performance would be far, far worse than their supposed benchmarks.
> Highest score was 5%, supposedly SOTA last year was IMO silver level.
No LLM last year got silver. Deepmind had a highly specialized AI system earning that
In my view as well it's not really cheating, it's just over fitting.
If a model doesn't do good in the benchmarks it will either be retrained until it does or you won't hear about it.
Your analogy is perfect. Training an LLM on math olympiad problems and then expecting it to secure enterprise software is like teaching someone chess and handing them a wrench
I've used gemini 2.5 this weekend with aider and it was frighteningly good.
It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.
I think overall quality with Gemini 2.5 is not much better than Gemini 2 in my experience. Gemini 2 was already really good, but just like Claude 3.7, Gemini 2.5 goes some steps forward and some steps backwards. It sometimes generates some really verbose code even when you tell it to be succinct. I am pretty confident that if you evaluate 2.5 for a bit longer you'll come to the same conclusion eventually.
> It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.
Even approximations must be right to be meaningful. If information is wrong, it's rubbish.
Presorting/labelling various data has value. Humans have done the real work there.
What is "leading" us at present are the exaggerated valuations of corporations. You/we are in a bubble, working to justify the bubble.
Until a tool is reliable, it is not installed where people can get hurt. Unless we have revised our concern for people.
Yep, and what they are going in cursor either the agentic stuff is really game changing.
People who can’t recognize this intentionally have their heads in the sand
People are really fundamentally asking two different questions when they talk about AI "importance": AI's utility and AI's "intelligence". There's a careful line between both.
1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.
2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.
I guess you haven't been on /r/cursor or forum.cursor.com lately?
"game changing" isn't exactly the sentiment there the last couple months.
My experience as someone who uses LLMs and a coding assist plugin (sometimes), but is somewhat bearish on AI is that GPT/Claude and friends have gotten worse in the last 12 months or so, and local LLMs have gone from useless to borderline functional but still not really usable for day to day.
Personally, I think the models are “good enough” that we need to start seeing the improvements in tooling and applications that come with them now. I think MCP is a good step in the right direction, but I’m sceptical on the whole thing (and have been since the beginning, despite being a user of the tech).
The whole MCP hype really shows how much of AI is bullshit. These LLMs have consumed more API documentation than possible for a single human and still need software engineers to write glue layers so they can use the APIs.
I don't think I agree, entirely.
The problem is that up until _very_ recently, it's been possible to get LLMs to generate interesting and exciting results (as a result of all the API documentation and codebases they've inhaled), but it's been very hard to make that usable. I think we need to be able to control the output format of the LLMs in a better way before we can work on what's in the output. I don't konw if MCP is the actual solution to that, but it's certainly an attempt at it...
That's reasonable along with your comment below too, but when you have the ceo of anthropic saying "AI will write all code for software engineers within a year" last month I would say that is pretty hard to believe given how it performs without user intervention (MCP etc...). It feels like bullshit just like the self driving car stuff did ~10 years ago.
I completely agree with you there. I think we're a generation away from these tools being usable with light supervision in the way I _want_ to use them, and I think the gap between now and that is about 10x smaller than the gap between that and autonomous agents.
Because it’s lossy compression. I also consumed a lot of books and even more movies and I don’t have good memory of it all. But some core facts and intuition from it.
AI is far better at regurgitating facts than me even if it's lossy compression but if someone gives me an api doc I can figure out how to use it without them writing a wrapper library around the parts that I need to use to solve whatever problem I'm working on.
> but if someone gives me an api doc I can figure out how to use it without them writing a wrapper library around the parts that I need to use to solve whatever problem I'm working on.
I think this is where AI is faling short hugely. AI _should_ be able to integrate with IDEs and tooling (e.g. LSP, Treesitter, Editorconfig) to make sure that it's contextually doing the right thin.
But it's not.
LeCun criticized LLM technology recently in a presentation: https://www.youtube.com/watch?v=ETZfkkv6V7Y
The accuracy problem won't just go away. Increasing accuracy is only getting more expensive. This sets the limits for useful applications. And casual users might not even care and use LLMs anyway, without reasonable result verification. I fear a future where overall quality is reduced. Not sure how many people / companies would accept that. And AI companies are getting too big to fail. Apparently, the US administration does not seem to care when they use LLMs to define tariff policy....
I don't know why anyone is surprised that a statistical model isn't getting 100% accuracy. The fact that statistical models of text are good enough to do anything should be shocking.
I think the surprising aspect is rather how people are praising 80-90% accuracy as the next leap in technological advancement. Quality is already in decline, despite LLMs, and programming was always a discipline where correctness and predictability mattered. It's an advancement for efficiency, sure, but on the yet unknown cost of stability. I'm thinking about all simulations based on applied mathematical concepts and all the accumulated hours fixing bugs - there's now this certain aftertaste, sweet for some living their lives efficiently, but very bitter for the ones relying on stability.
You're completely correct, of course. The issue is that most people are not looking for quality, only efficiency. In particular, business owners don't care about sacrificing some correctness if it means they can fire slews of people. Worse, gullible "engineers" that should be the ones prioritizing correctness are so business-brainwashed themselves that they like wise slop up this nonsense at the expense of sacrificing their own concern for the only principles that even made the software business remotely close to being worthy of the title "engineering".
That "good enough" is the problem. It requires context. And AI companies are selling us that "good enough" with questionable proof. And they are selling grandiose visions to investors, but move the goal post again and again.
A lot of companies made Copilot available to their workforce. I doubt that the majority of users understand what a statistical model means. The casual, technically inexperienced user just assumes that a computer answer is always right.
I'd say most of the recent AI model progress has been on price.
A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in coding performance. But it's small enough to run on a consumer GPU, which means deployment price is now down to $0.10 per hour. (from $12+ for models requiring 8x H100)
Have you compared it with 8-bit QwQ-17B?
In my evals 8 bit quantized smaller Qwen models were better, but again evaluating is hard.
There’s no QwQ 17B that I’m aware of. Do you have a HF link?
You're right, sorry...I just tested Qwen models, not QwQ, I see QwQ only has 32B.
No worries, QwQ is the thinking model from Qwen, it’s a common misconception.
I think they should’ve named it something else.
Yeah, I'm thinking of this from a Wardley map standpoint.
What innovation opens up when AI gets sufficiently commoditized?
One thing I’ve seen is large enterprises extracting money from consumers by putting administrative burden on them.
For example, you can see this in health insurance reimbursements and wireless carriers plan changes. (ie, Verizon’s shift from Do More, etc to what they have now)
Companies basically set up circumstances where consumers lose small amounts of money on a recurring basis or sporadically enough that the people will just pay the money rather than a maze of calls, website navigation and time suck to recover funds due to them or that shouldn’t have been taken in the first place.
I’m hopeful well commoditized AI will give consumers a fighting chance at this and other types of disenfranchisement that seems to be increasingly normalized by companies that have consultants that do nothing but optimize for their own financial position.
Brute force, brute force everything at least for the domains you can have automatic verification in.
I hope it's true. Even if LLMs development stopped now, we would still keep finding new uses for them at least for the next ten years. The technology is evolving way faster than we can meaningfully absorb it and I am genuinely frightened by the consequences. So I hope we're hitting some point of diminishing returns, although I don't believe it a bit.
> ...whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality.
I'm not surprised, because I don't expect pattern matching systems to grow into something more general and useful. I think LLM's are essentially running into the same limitations that the "expert systems" of the 1980's ran into.
This was published the day before Gemini 2.5 was released. I'd be interested if they see any difference with that model. Anecdotally, that is the first model that really made me go wow and made a big difference for my productivity.
Ya, I find this hard to imagine aging well. Gemini 2.5 solved (at least much better than) multiple real world systems questions I've had in the past that other models could not. Its visual reasoning also jumped significantly on charts (e.g. planning around train schedules)
Even Sonnet 3.7 was able to do refactoring work on my codebase sonnet 3.6 could not.
Really not seeing the "LLMs not improving" story
I doubt it. It still flails miserably like the other models on anything remotely hard, even with plenty of human coaxing. For example, try to get it to solve: https://www.janestreet.com/puzzles/hall-of-mirrors-3-index/
FWIW 2.5-exp was the only one that managed to get a problem I asked it right, compared to Claude 3.7 and o1 (or any of the other free models in Cursor).
It was reverse engineering ~550MB of Hermes bytecode from a react native app, with each function split into a separate file for grep-ability and LLM compatibility.
The others would all start off right then quickly default to just greping randomly what they expected it to be, which failed quickly. 2.5 traced the function all the way back to the networking call and provided the expected response payload.
All the others hallucinated the networking response I was trying to figure out. 2.5 Provided it exactly enough for me to intercept the request and using the response it provided to get what I wanted to show up.
How did you fit 550MB of bytecode into the context window? Was this using 2.5 in an agentic framework? (i.e. repeated model calls and tool usage)
I manually pre-parsed the bytecode file with awk into a bazillion individual files that were each just one function, and gave it the hint to grep to sort through them. This was all done in Cursor.
Quick example of the output it gave and it's process.https://i.imgur.com/Cmg4KK1.png
https://i.imgur.com/ApNxUkB.png
I’d say the average person wouldn’t understand that problem, let alone solve it.
As someone who was wildly disappointed with the hype around Claude 3.7, Gemini 2.5 is easily the best programmer-assistant LLM available, IMO.
But it still feels more like a small incremental improvement rather than a radical change, and I still feel its limitations constantly.
Like... it gives me the sort of decent but uninspired solution I would expect it to generate without predictably walking me through a bunch of obvious wrong turns as I repeatedly correct it as I would have to have done with earlier models.
And that's certainly not nothing and makes the experience of using it much nicer, but I'm still going to roll my eyes anytime someone suggests that LLMs are the clear path to imminently available AGI.
This is exactly my sentiment. Sonnet-3.5-latest was the perfect code companion: wrote just the right amount of okay quality code but its strength was it really tried to adhere to your instructions. sonnet-3.7 was the exact opposite, wrote waaay too much code and overengineered things like crazy while having very poor instruction adherence. Gemini 2.5 Pro is basically what I hoped sonnet-3.7 would be: follows instructions well but still softly opinionated, massive (usable) context window, fast response, more biased towards latest best practices and a up to date knowledge cutoff.
I'm wondering how much gemini 2.5 being "amazing" comes from sonnet-3.7 being such a disappointment.
There's somehow this belief that "newer models will disprove <insert LLM criticism here>" despite the "newer" models being... just a scaled-up version of a previous model, or some anciliary features tacked on. An LLM is an LLM is an LLM: I'll believe it when I see otherwise.
I agree, about both the issue with benchmarks not being relevant to actual use cases and the "wants to sound smart" issue. I have seen them both first hand interacting with llms.
I think the ability to embed arbitrary knowledge written in arbitrary formats is the most important thing llms have achieved.
In my experience trying to get an llm to perform a task as vast and open ended as the one the author describes is fundamentally misguided. The llms were not trained for that and won't be able to do it in a satisfactory degree. But all this research has thankfully provided us with the software and hardware tools where one could start working on training a model that can.
Contrast that to 5-6 years ago, when all you could hope for this kind of thing was simple rule based and pattern matching systems.
My lived experience is that unless there's some new breakthrough's AI is more akin to a drill to replace a hammer than a tractor to replace the plow or a printing press.
Maybe any AI experts can elaborate on this but it seems there's a limit to the fundamental underlying model of the LLM architecture of transformers and tokens.
LLM's are amazing but we might need something more or some new paradigm to push us towards true AGI.
Sounds like someone drank their own Kool aid (believing current AI can be a security researcher), and then gets frustrated when they realize they have overhyped themselves.
Current AI just cannot do the kind of symbolic reasoning required for finding security vulnerabilities in softwares. They might have learned to recognize "bad code" via pattern matching, but that's basically it.
My mom told me yesterday that Paul Newman had massive problems with alcohol. I was somewhat skeptical, so this morning I asked ChatGPT a very simple question:
"Is Paul Newman known for having had problems with alcohol?"
All of the models up to o3-mini-high told me he had no known problems. Here's o3-mini-high's response:
"Paul Newman is not widely known for having had problems with alcohol. While he portrayed characters who sometimes dealt with personal struggles on screen, his personal life and public image were more focused on his celebrated acting career, philanthropic work, and passion for auto racing rather than any issues with alcohol. There is no substantial or widely reported evidence in reputable biographies or interviews that indicates he struggled with alcohol abuse."
There is plenty of evidence online that he struggled a lot with alcohol, including testimony from his long-time wife Joanne Woodward.
I sent my mom the ChatGPT reply and in five minutes she found an authoritative source to back her argument [1].
I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
Lesson(s) learned... Including not doubting my mother's movie trivia knowledge.
[1] https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...
This is less an LLM thing than an information retrieval question. If you choose a model and tell it to “Search,” you find citation based analysis that discusses that he indeed had problems with alcohol. I do find it interesting it quibbles whether he was an alcoholic or not - it seems pretty clear from the rest that he was - but regardless. This is indicative of something crucial when placing LLMs into a toolkit. They are not omniscient nor are they deductive reasoning tools. Information retrieval systems are excellent at information retrieval and should be used for information retrieval. Solvers are excellent at solving deductive problems. Use them. The better they get at these tasks alone is cool but is IMO a parlor trick since we have nearly optimal or actually optimal techniques that don’t need an LLM. The LLM should use those tools. So, click search next time you have an information retrieval question. https://chatgpt.com/share/67f2dac0-3478-8000-9055-2ae5347037...
I realise your answer wasn't assertive, but if I heard this from someone actively defending AI it would be a copout. If the selling point is that you can ask these AIs anything then one can't retroactively go "oh but not that" when a particular query doesn't pan out.
This is a bit of a strawman. There are certainly people who claim that you can ask AIs anything but I don't think the parent commenter ever made that claim.
"AI is making incredible progress but still struggles with certain subsets of tasks" is self-consistent position.
It’s not the position of any major AI company, curiously.
lotta words here to say AI can't do basic search right
Any information found in a web search about Newman will be available in the training set (more or less). It's almost certainly a problem of alignment / "safety" causing this issue.
"Any information found in a web search about Newman will be available in the training set"
I don't think that is a safe assumption these days. Training modern LLM isn't about dumping in everything on the Internet. To get a really good model you have to be selective about your sources of training data.
They still rip off vast amounts of copyrighted data, but I get the impression they are increasingly picky about what they dump into their training runs.
There’s a simpler explanation than that’s that the model weights aren’t an information retrieval system and other sequences of tokens are more likely given the totality of training data. This is why for an information retrieval task you use an information retrieval tool similarly to how for driving nails you use a hammer rather than a screw driver. It may very well be you could drive the nail with the screw driver, but why?
You think that's a simpler explanation? Ok. I think given the amount of effort that goes into "safety" on these systems that my explanation is vastly more likely than somehow this information got lost in the vector soup despite being attached to his name at the top of every search result[0].
0 https://www.google.com/search?q=did+paul+newman+have+a+drink...
Except if safety blocked this, it would have also blocked the linked conversation. Alignment definitely distorts behaviors of models, but treating them as information retrieval systems is using a screw driver to drive nails. Your example didn’t refute this.
Gemini (2.5 Pro):
"Yes, Paul Newman was widely known for being a heavy drinker, particularly of beer. He himself acknowledged his significant alcohol consumption."
The answer I got (https://gemini.google.com/share/9e327dc4be03) includes references such as https://apnews.com/article/entertainment-reviews-movies-paul... and https://www.psychologytoday.com/us/blog/the-playing-field/20... although they are redacted from the public-sharing link.
Though a local model I'm running (gemma-3-27b-it; https://huggingface.co/lmstudio-community/gemma-3-27b-it-GGU...) just told me various correct sounding bits about his history with alcohol (correctly citing his alma mater and first wife), but threw in:
"Sobriety & AA: Newman got sober in 1964 and remained so for the rest of his life."
Which doesn't check out. And it includes plausible but completely hallucinated URLs (as well as a valid biography.com URL that completely omits information about alcohol.)
Gemma 3 4B (QAT quant): Yes, Paul Newman was indeed known to have struggled with alcohol throughout his life. While he maintained a public image of a charming, clean-cut star, he privately battled alcoholism for many years. He sought treatment in the late 1980s and early 1990s and was reportedly very open about his struggles and the importance of seeking help.
Perplexity:
>Paul Newman is indeed known for having struggled with alcohol during his life. Accounts from various sources, including his own memoir and the documentary ... (https://www.perplexity.ai/search/is-paul-newman-known-for-ha...)
I guess there's something about ChatGPT's set up that makes it different? Maybe they wanted it to avoid libeling people?
Grok:
> Yes, Paul Newman was known to have struggled with alcohol at certain points in his life. In his early years, particularly during his time in the Navy and into his acting career, Newman admitted to heavy drinking. He was open about his fondness for beer and once jokingly referred to himself as a “functional alcoholic.” In a 1988 interview with The New York Times, he acknowledged that he had a period where he drank too much, stating, “I was a very good drinker. I could put it away.” ...
https://grok.com/share/bGVnYWN5_86843e8a-39ee-415d-8785-4f8e...
Does the as yet unwritten prequel of Idiocracy tell the tale of when we started asking Ai chat bots for facts and this was the point of no return for humanity?
The chatgpt'd tariffs marks this epoch perfectly.
It turns out there's huge demand for un-monetized web search.
I like that it's unmonetized, of course, but that's not why I use AI. I use AI because it's better at search. When I can't remember the right keywords to find something, or when the keywords aren't unique, I frequently find that web search doesn't return what I need and AI does.
It's impressive how often AI returns the right answer to vague questions. (not always though)
Google used to return the right answer to vague questions until it decided to return the most lucrative answer to vague questions instead.
Fortunately there is a lot of competition in the LLM space.
Edit: and, more importantly, plenty of people willing to pay a subscription for good quality.
Soon sadly, there will be a huge demand for un-monetized LLMs. Enshitification is coming.
Can you blame the users for asking it, when everyone is selling that as a key defining feature?
I use it for asking - often very niche - questions on advanced probability and simulation modeling, and it often gets those right - why those and not a simple verifiable fact about one of the most popular actors in history?
I don’t know about Idiocracy, but something that I have read specific warnings about is that people will often blame the user for any of the tool’s misgivings.
Some prior works that work as prequels include C.M. Kornbluth's "The Marching Morons" and "The Little Black Bag."
"Is Paul Newman known for having had problems with alcohol?"
https://chatgpt.com/share/67f332e5-1548-8012-bd76-e18b3f8d52...
Your query indeed answers "...not widely known..."
"Did Paul Newman have problems with alcoholism?"
https://chatgpt.com/share/67f3329a-5118-8012-afd0-97cc4c9b72...
"Yes, Paul Newman was open about having struggled with alcoholism"
What's the issue? Perhaps Paul Newman isn't _famous_ ("known") for struggling with alcoholism. But he did struggle with alcoholism.
Your usage of "known for" isn't incorrect, but it's indeed slightly ambiguous.
Counterpoint: Paul Newman was absolutely a famous drunk, as evidenced by this Wikipedia page.* Any query for "paul newman alcohol" online will return dozens of reputable sources on the topic. Your post is easily interpretable as handwaving apologetics, and it gives big "Its the children who are wrong" energy.
*https://en.wikipedia.org/wiki/Newman_Day
"Paul Newman alcohol" is just showing you results where those words are all present, it's not really implying how widely known it is.
What are you, an LLM? Look at the results of the first twenty hits and come back, then tell me that they don't speak to that specific issue.
Widely reported does not imply widely known.
How else does an LLM distinguish what is widely known, given there are no statistics collected on the general populations awareness of any given celebrities vices? Robo-apologetics in full force here.
This may have hit the nail on the head about the weaknesses of LLM's.
They're going to regurgitate something not so much based on facts, but based on things that are accessible as perceived facts. Those might be right, but they might be wrong also; and no one can tell without doing the hard work of checking original sources. Many of what are considered accepted facts, and also accessible to LLM harvesting, are at best derived facts, often mediated by motivated individuals, and published to accessible sources by "people with an interest".
The weightings used by any AI should be based on the facts, and not the compounded volume of derived, "mediated", or "directed" facts - simply, because they're not really facts; they're reports.
It all seems like dumber, lazier search engine stuff. Honestly, what do I know about Paul Newman? But, Joanne Woodward and others who knew and worked with him should be weighted as being, at least, slightly more credible that others; no matter how many text patterns "catch the match" flow.
> I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
I think we'll have a term like we have for parents/grandparents that believe everything they see on the internet but specifically for people using LLMs.
Look at how many people believe in extremist news outlets!
What is that term?
Thats not really 'simple' for an LLM. This is a niche information about a specifc person, LLM's train on massive amount of data, the more a topic is being present in the data, the better will the answers be.
Also, you can/should use the "research" mode for questions like this.
The question is simple and verifiable - it is impressive to me that it’s not contained in the LLM’s body of knowledge - or rather that it can’t reach the answer.
This is niche in the grand scheme of knowledge but Paul Newman is easily one of the biggest actors in history, and the LLM has been trained on a massive corpus that includes references to this.
Where is the threshold for topics with enough presence in the data?
The question might be simple and verifiable, but it is not a simple for an LLM to mark a particular question as such. This is the tricky part.
An LLM does not care about your question, it is a bunch of math that will spit out a result based on what you typed in.
I asked GPT-4.5 and it searched the web and immediately gave me a "yes" with paragraphs of sources cited.
Truth is a probability game. Just keep trying until you arrive.
"man puts prompt into LLM" > "LLM ships bad code" > "machine fails" > "person dies" > "fire man" > "man puts prompt into LLM"
Such accountability
3-4 hours is enough time for It to have crawled the hacker news comments section. That's about the frequency the AI bots crawl my little out of the way blog.
I just had Cursor Pro + Sonnet 3.7 Max one shot a python script to send this question to every model available through groq.
>Found 24 models: llama3-70b-8192, llama-3.2-3b-preview, meta-llama/llama-4-scout-17b-16e-instruct, allam-2-7b, llama-guard-3-8b, qwen-qwq-32b, llama-3.2-1b-preview, playai-tts-arabic, deepseek-r1-distill-llama-70b, llama-3.1-8b-instant, llama3-8b-8192, qwen-2.5-coder-32b, distil-whisper-large-v3-en, qwen-2.5-32b, llama-3.2-90b-vision-preview, deepseek-r1-distill-qwen-32b, whisper-large-v3, llama-3.3-70b-specdec, llama-3.3-70b-versatile, playai-tts, whisper-large-v3-turbo, llama-3.2-11b-vision-preview, mistral-saba-24b, gemma2-9b-it
Excluding the ones that do not support chat completions, all but one (qwen-qwq-32b) answered in the affirmative. The answer from qwen-qwq-32b said:
Using lack of progress in a specialized field as a barometer for overall progress is kind of silly. I just spent the last few days 'vibe coding' an application and I have to say that it's pretty remarkable how capable it is now relative to my experience last year.It took three minutes for me to do the above from the time I created my API key to when I had an answer.
I find that everyone who replies with examples like this is an expert using expert skills to get the LLM to perform. Which makes me think why is this a skill that is useful to general public as opposed to another useful skill for technical knowledge workers to add to their tool belt?
I agree. But I will say that at least in my social circles I'm finding that a lot of people outside of tech are using these tools, and almost all of them seem to have a healthy skepticism about the information they get back. The ones that don't will learn one way or the other.
Gemini 2.5 Pro
Yes, Paul Newman was known for being a heavy drinker, particularly of beer. 1 He acknowledged his high consumption levels himself. 1. Review: Paul Newman memoir stuns with brutal honesty - AP News
apnews.com
While he maintained an incredibly successful career and public life, accounts and biographies note his significant alcohol intake, often describing it as a functional habit rather than debilitating alcoholism, although the distinction can be debated. He reportedly cut back significantly in his later years.
I just asked 4o:
Yes, Paul Newman did experience significant struggles with alcohol. In his posthumously published memoir, The Extraordinary Life of an Ordinary Man, Newman candidly discusses his drinking habits and acknowledges his long-term battle with alcoholism. He describes himself as a "functioning alcoholic," a trait he noted was shared with his father. At one point, Newman was reported to consume a case of beer daily, followed by spirits, until he eventually gave up hard liquor.
My own markov chains follow up "until he" with fundamentally distinct follow-ups.
These models are not reliable sources of information. They are either out of date, subject to hallucination, or just plain wrong for a variety of reasons. They are untrustworthy to ask facts like this.
I appreciate your consideration of a subjective question and how you explained it and understand these nuances. But please - do not trust chatgpt etc. I continue to be frustrated at the endless people claiming something is true from chatgpt. I support the conclusions of this author.
this seems less like an issue with accuracy and more like an issue with model providers making sure they don't get sued for libel
I thought about that too.
For questions like this, you need to tell the bot to search the web.
> Hi, can you search the web and find out if Paul Newman has any alcohol problems?
> Paul Newman, the acclaimed actor and philanthropist, had a well-documented fondness for alcohol, particularly beer. He was known to consume significant quantities, with some accounts noting he would drink a case of beer in a day. This level of consumption led some to describe him as a functioning alcoholic. His drinking habits reportedly caused strain in his marriage to Joanne Woodward. In the late 1970s, an incident occurred where Woodward did not allow Newman back into their house because of his drinking. He promised to give up hard liquor as a compromise, but continued to drink beer, sometimes consuming a case during dinner.
Perplexity had no problem with the question, replying in part, "Paul Newman is known to have struggled with alcohol throughout his life, as revealed in various accounts, including his own memoir and documentaries about his life...."
I don't think this reflects a flaw in LLM's at all. It is true that Paul Newman is not "known for" having alcohol problems. Paul Newman is known for acting, and salad dressing. Paul Newman has been "known to" have had alcohol problems, which formed as a question correctly one-shots accurate results for me across a few common LLMs. Someone like Betty Ford could be "known for" having alcohol problems. Paul Newman's alcohol problems are relatively specialized knowledge.
good point, though a great LLM answer would DWIM here and explain both interpretations as you have done here, as opposed to (what presumably happened) focus on one and ignore the other.
And yet the New Yorker article was the first thing that my Google search for “Paul Newman alcoholism” returned.
Apparently it isn’t so specialized that a pretty obvious old fashioned web search on Google wouldn’t immediately return an authoritative source.
LLMs aren't good at being search engines, they're good at understanding things. Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
> LLMs aren't good at being search engines, they're good at understanding things.
LLMs are literally fundamentally incapable of understanding things. They are stochastic parrots and you've been fooled.
A stochastic parrot with a sufficiently tiny residual error rate needs a stochastic model so precisely compressing the world and sophisticated decompression algorithms that it could be called reasoning.
Take two 4K frames of a falling vase, ask a model to predict the next token... I mean the following images. Your model now needs include some approximations of physics - and the ability to apply it correctly - to produce a realistic outcome. I'm not aware of any model capable of doing that, but that's what it would mean to predict the unseen with high enough fidelity.
We're talking about a stochastic parrot which in many circumstances responds in a way which is indistinguishable from actual understanding.
I've always been amazed by this. I have never not been frustrated with the profound stupidity of LLMs. Obviously I must be using it differently because I've never been able to trust it with anything and more than half the time I fact check it even for information retrieval it's objectively incorrect.
If you got as far as checking the output it must have appeared to understand your question.
I wouldn't claim LLMs are good at being factual, or good at arithmetic, or at drawing wine glasses, or that they are "clever". What they are very good at is responding to questions in a way which gives you the very strong impression they've understood you.
I vehemently disagree. If I ask a question with an objective answer, and it simply makes something up and is very confident the answer is correct, what the fuck has it understood other than how to piss me off?
It clearly doesn't understand that the question has a correct answer, or that it does not know the answer. It also clearly does not understand that I hate bullshit, no matter how many dozens of times I prompt it to not make something up and would prefer an admittance of ignorance.
It didn't understand you but the response was plausible enough to require fact checking.
Although that isn't literally indistinguishable from 'understanding' (because your fact checking easily discerned that) it suggests that at a surface level it did appear to understand your question and knew what a plausible answer might look like. This is not necessarily useful but it's quite impressive.
There are times it just generates complete nonsense that has nothing to do with what I said, but it's certainly not most of the time. I do not know how often, but I'd say it's definitely under 10% and almost certainly under 5% that the above happens.
Sure, LLMs are incredibly impressive from a technical standpoint. But they're so fucking stupid I hate using them.
> This is not necessarily useful but it's quite impressive.
I think we mostly agree on this. Cheers.
Its ok to be paranoid
Fact checking is paranoia?
For them to work at all they need to have some representation of concepts. Recent research at anthropic has shown a surprising complexity in their reasoning behavior. Perhaps the parrot here is you.
What do you call someone that mentions "stochastic parrots" every time LLMs are mentioned?
That makes me think, has anyone ever heard of an actual parrot which wasn't stochastic?
I'm fairly sure I've never seen a deterministic parrot which makes me think the term is tautological.
It's the first time I've ever used that phrase on HN. Anyway, what phrase do you think works better than 'stochastic parrot' to describe how LLMs function?
It’s good rhetoric but bad analogy. LLMs can be very creative (to the point of failure, in hallucinations).
I don’t know if there is a pithy shirt phrase to accurately describe how LLMs function. Can you give me a similar one for how humans think? That might spur my own creativity here.
Try to come up with a way to prove humans aren't stochastic parrots then maybe people will atart taking you seriously. Just childish reddit angst rn nothing else.
I hate to be the burden of proof guy, but in this case I'll say: the burden of proof is on you to prove that humans are stochastic parrots. For millenia, nobody thought to assert that the human brain was computational in nature, until people invented computers, and all of a sudden started asserting that many the human brain was just like a classical computer.
Of course, this turned out to be completely false, with advances in understanding of neural networks. Now, again with no evidence other than "we invented this thing that's, useful to us" people have been asserting that humans are just like this thing we invented. Why? What's the evidence? There never is any. It's high dorm room behavior. "What if we're all just machines, man???" And the argument is always that if I disagree with you when you assert this, then I am acting unscientifically and arguing for some kind of magic.
But there's no magic. The human brain just functions in a way different than the new shiny toys that humans have invented, in terms of ability to model an external world, in terms of the way emotions and sense experience are inseparable from our capacity to process information, in terms of consciousness. The hardware is entirely different, and we're functionally different.
The closest things to human minds are out there, and they've been out there for as long as we have: other animals. The real unscientific perspective is that to get high on your own supply and assert that some kind of fake, creepily ingratiating Spock we made up (who is far less charming than Leonard Nimony) is more like us than a chimp is.
> Try to come up with a way to prove humans aren't stochastic parrots
Look around you
Look at Skyscrapers. Rocket ships. Agriculture.
If you want to make a claim that humans are nothing more than stochastic parrots then you need to explain where all of this came from. What were we parroting?
Meanwhile all that LLMs do is parrot things that humans created
Skyscrapers: trees, mountains, cliffs, caves in mountainsides, termite mounds, humans knew things could go high, the Colosseum was built two thousand years ago as a huge multi-storey building.
Rocket ships: volcanic eruptions show heat and explosive outbursts can fling things high, gunpowder and cannons, bellows showing air moves things.
Agriculture: forests, plains, jungle, desert oases, humans knew plants grew from seeds, grew with rain, grew near water, and grew where animals trampled them into the ground.
We need a list of all atempted ideas, all inventions and patents that were ever tried or conceived, and then we see how inventions are the same random permutations on ideas with Darwinian style survivorship as everything else; there were steel boats with multiple levels in them before skyscrapers; is the idea of a tall steel building really so magical when there were over a billion people on Earth in 1800 who could have come up with it?
You’re likening actual rocketry to LLMs being mildly successful at describing Paul Newman’s alcohol use on average when they already have the entire internet handed to them.
> when there were over a billion people on Earth in 1800 who could have come up with it
My point is that humans did come up with it. Humans did not parrot it from someone or something else that showed it to us. We didn't "parrot" splitting the atom. We didn't learn how to build skyscrapers from looking at termite hills and we didn't learn to build rockets that can send a person to the moon from seeing a volcano
You are just speaking absolute drivel
It's obvious that humans imitate concepts and don't come up with things de-novo from a blank slate of pure intelligence. So your claim hinges on LLMs parrotting the words they are trained on. But they don't do that, their training makes them abstract over concepts and remix them in new ways to output sentences they weren't trained on, e.g.:
Prompt: "Can you give me a URL with some novel components, please?"
DuckDuckGo LLM returns: "Sure! Here’s a fictional URL with some novel components: https://www.example-novels.com/2023/unique-tales/whimsical-j..."
An living parrot echoing "pieces of eight" cannot do this, it cannot say "pieces of <currency>" or "pieces of <valuable mineral>" even if asked to do that. The LLM training has abstracted some concept of what it means for a text pattern to be a URL and what it means for things to be "novel" and what it means to switch out the components of a URL but keep them individually valid. It can also give a reasonable answer asking for a new kind of protocol. So your position hinges on the word "stochastic" which is used as a slur to mean "the LLM isn't innovating like we do it's just a dice roll of remixing parts it was taught". But if you are arguing that makes it a "stochastic parrot" then you need to consider splitting the atom in its wider context...
> "We didn't "parrot" splitting the atom"
That's because we didn't "split the atom" in one blank-slate experiment with no surrounding context. Rutherford and team disintegrated the atom in 1914-1919 ish, they were building on the surrounding scientific work happening at that time: 1869 Johann Hittorf recognising that there was something coming in a straight line from or near the cathode of a Crookes vacuum tube, 1876 Eugen Goldstein proving they were coming from the cathode and naming them cathode rays (see: Cathode Ray Tube computer monitors), and 1897 J.J Thompson proving the rays are much lighter than the lightest known element and naming them Electrons, the first proof of sub-atomic particles existing. He proposed the model of the atom as a 'plum pudding' (concept parroting). Hey guess who JJ Thomspon was an academic advisor of? Ernest Rutherford! 1911 Rutherford discovery of the atomic nucleus. 1909 Rutherford demonstrated sub-atomic scattering and Millikan determined the charge on an electron. Eugen Goldstein also discovered the anode rays travelling the other way in the Crookes tube and that was picked up by Wilhelm Wien and it became Mass Spectrometry for identifying elements. In 1887 Heinrich Hertz was investigating the Photoelectric effect building on the work of Alexandre Becquerel, Johann Elster, Hans Geitel. Dalton's atomic theory of 1803.
Not to mention Rutherford's 1899 studies of radioactivity, following Henri Becquerel's work on Uranium, following Marie Curie's work on Radium and her suggestion of radioactivity being atoms breaking up, and Rutherford's student Frederick Soddy and his work on Radon, and Paul Villard's work on Gamma Ray emissions from Radon.
When Philipp Lenard was studying cathode rays in the 1890s he bought up all the supply of one phosphorescent material which meant Röntgen had to buy a different one to reproduce the results and bought one which responded to X-Rays as well, and that's how he discovered them - not by pure blank-sheet intelligence but by probability and randomness applied to an earlier concept.
That is, nobody taught humans to split the atom and then humans literally parotted the mechanism and did it, but you attempting to present splitting the atom as a thing which appeared out of nowhere and not remixing any existing concepts is, in your terms, absolute drivel. Literally a hundred years and more of scientists and engineers investigating the subatomic world and proposing that atoms could be split, and trying to work out what's in them by small varyations on the ideas and equipment and experiments seen before, you can just find names and names and names on Wikipedia of people working on this stuff and being inspired by others' work and remixing the concepts in it, and we all know the 'science progresses one death at a time' idea that individual people pick up what they learned and stick with it until they die, and new ideas and progress need new people to do variations on the ideas which exist.
No people didn't learn to build rockets from "seeing a volcano" but if you think there was no inspiration from fireworks, cannons, jellyfish squeezing water out to accelerate, no sudies of orbits from moons and planets, no chemistry experiments, no inspiration from thousands of years of flamethrowers: https://en.wikipedia.org/wiki/Flamethrower#History no seeing explosions moving large things, you're living in a dream
> fireworks, cannons, jellyfish squeezing water out to accelerate, no sudies of orbits from moons and planets, no chemistry experiments, no inspiration from thousands of years of flamethrowers
Fireworks, cannons, chemistry experiments and flamethrowers are all human inventions
And yes, exactly! We studied orbits of moons and planets. We studied animals like Jellyfish. We choose to observe the world, we extracted data, we experimented, we saw what worked, refined, improved, and succeeded
LLMs are not capable of observing anything. They can only regurgitate and remix the information they are fed by humans! By us, because we can observe
An LLM trained on 100% wrong information will always return wrong information for anything you ask it.
Say you train an LLM with the knowledge that fire can burn underwater. It "thinks" that the step by step instructions for building a fire is to pile wood and then pour water on the wood. It has no conflicting information in its model. It cannot go try to build a fire this way and observe that it is wrong. It is a parrot. It repeats the information that you give it. At best it can find some relationships between data points that humans haven't realized might be related
A human could easily go attempt this, realize it doesn't work, and learn from the experience. Humans are not simply parrots. We are capable of exploring our surroundings and internalizing things without needing someone else to tell us how everything works
> That is, nobody taught humans to split the atom and then humans literally parotted the mechanism and did it, but you attempting to present splitting the atom as a thing which appeared out of nowhere and not remixing any existing concepts is, in your terms, absolute drivel
Building on the work of other humans is not parroting
You outlined the absolute genius of humanity building from first principles all the way to splitting the atom and you still think we're just parroting,
I think we disagree what parroting is entirely.
What does the word "understand" mean to you?
An ability to answer questions with a train of thought showing how the answer was derived, or the self-awareness to recognize you do not have the ability to answer the question and declare as much. More than half the time I've used LLMs they will simply make answers up, and when I point out the answer is wrong it simply regurgitates another incorrect answer ad nauseum (regularly cycling through answers I've already pointed out are incorrect).
Rather than give you a technical answer - if I ever feel like an LLM can recognize its limitations rather than make something up, I would say it understands. In my experience LLMs are just algorithmic bullshitters. I would consider a function that just returns "I do not understand" to be an improvement, since most of the time I get confidently incorrect answers instead.
Yes, I read Anthropic's paper from a few days ago. I remain unimpressed until talking to an LLM isn't a profoundly frustrating experience.
I just want to say that's a much better answer than I anticipated!
> I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
They're quite literally being sold as a replacement for human intellectual labor by people that have received uncountable sums of investment money towards that goal.
The author of the post even says this:
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
Can't blame people "fact checking" something that's supposed to fill these shoes.
People should be (far) more critical of LLMs given all of these style of bold claims, not less.
Also, telling people they're "holding it wrong" when they interact with alleged "Ay Gee Eye" "superintelligence" really is a poor selling point, and no way to increase confidence in these offerings.
These people and these companies don't get to make these claims that threaten the livelihood of millions of people, inflate a massive bubble, impact hiring decisions and everything else we've seen and then get excused cause "whoops you're not supposed to use it like that, dummy."
Nah.
Your point is still trivially disproven by the fact that not even humans are expected to know all the world's trivia off the top of their heads.
We can discuss whether LLMs live up to the hype, or we can discuss how to use this new tool in the best way. I'm really tired of HN insisting on discussing the former, and I don't want to take part in that. I'm happy to discuss the latter, though.
> Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
Hm nope, now that the web if flooded by LLM generated content it's game over. I can't tell how many times I almost got fooled by recipes &co which seem legit at first but are utter non sense. And now we're feeding that garbage back to where it came from
That's not relevant to what we're discussing, though.
Ironically though an LLM powered search engine (some word about being perplexed) is becoming way better than the undisputed king of traditional search engines (something oogle)
That's because they put an LLM over a traditional search engine.
Google Labs has AI Mode now, apparently.
https://labs.google.com/search/experiment/22
Hm, that's not available to me, what is it? If its an LLM over Google, didn't they release that a few months ago already?
US only for now may be the issue?
It expands what they had before with AI Overviews, but I’m not sure how new either of those are. It showed up for me organically as an AI Mode tab on a native Google search in Firefox ironically.
https://support.google.com/websearch/answer/16011537
Very interesting, thank you!
No worries.
What happens if you go directly to https://google.com/aimode ?
It asks me to change some permissions, but that help page says this is only available in the US, so I suppose I'll get blocked right after I change them.
Perplexity and open-webui+ollama in web search mode answer this question correctly.
LLMs will never be good at specific knowledge unless specifically trained for with narrow "if else" statements.
Its good for broad general overview such as most popular categories of books in the world.
Really? Open-AI says PhD intelligence is just around the corner!
If we were to survey 100 PhDs how many would know correctly that Paul Newman had an alcohol problem.
I would hope that 100% of them would be able to figure out how to find out.
Ah, but isn’t that the problem here - asking an LLM for facts without requesting a search is like asking a PhD to answer a question “off the top of your head”. For pop culture questions the PhD likely brings little value.
They should know better than to guess. Educated, honest, intelligent people don't spout off a wild ass guess, if they don't know something they say so.
I don't think they mean "knowledge" when they talk about "intelligence." LLMs are definitely not knowledge bases. They can transform information given to them in impressive ways, but asking a raw (non-RAG-enabled) LLM to provide its own information will probably always be a mistake.
They kind of are knowledge bases, just not in the usual way. The knowledge is encoded in the words they were trained on. They weren't trained on words chosen at random; they were trained on words written by humans to encode some information. In fact, that's the only thing that makes LLMs somewhat useful.
Looks like you are using the wrong models
https://g.co/gemini/share/ffa5a7cd6f46
I can’t reproduce. Maybe others reported the error and someone adjusted the expected answer, I do not know enough about OpenAI operations to say for sure.
The reason this bothers me is that comments like this reinforce the believes of people that could otherwise find value in these tools.
But I think points like this would be better made in shared chats or screenshots, since we do not have something like a core dump or stacktrace to attach.
And while I am not saying OP did this, I have seen technically skilled engineers asserting/implying that llm/chatbots aren’t good or not useful to them look at their chat log that a multitude of topics that I am sure would impact the result of the query.
Yes. It can be an UX problem. Yes. It can be an algorithmc problem. But they are just tools that can be used wrong and not a perfect mechanical brain.
Testing the query on Kagi
# Quick Answer
Yes, Paul Newman struggled with alcohol. His issues with alcohol were explored in the HBO Max documentary, The Last Movie Stars, and Shawn Levy's biography, Paul Newman: A Life. According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating.
(4x Source footnotes omitted for readability)
# Ki Multi-step Research Assistant
Paul Newman is known to have struggled with alcohol. According to his posthumous memoir, Newman candidly discussed his issues with drinking and self-doubt, describing himself as an alcoholic who was tormented by insecurities[^1][^2]. He reportedly drank a significant amount of beer daily and later moved on to stronger drinks like Scotch[^3][^4]. His drinking habits were a notable part of his life, and he was often identified by his beer drinking[^5][^6]. Despite these struggles, Newman was also recognized for his generosity and devotion to his family[^7].
[^1]: [Posthumous memoir reveals Paul Newman's struggle with alcohol](https://www.express.co.uk/entertainment/films/1688113/paul-n...) [^2]: [Review: Paul Newman memoir stuns with brutal honesty - AP News](https://apnews.com/article/entertainment-reviews-movies-paul...) [^3]: [Was Paul Newman Really an Alcoholic? - The Cinemaholic](https://thecinemaholic.com/was-paul-newman-really-an-alcohol...) [^4]: [Who Paul Newman Was—and Who He Wanted to Be](https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...) [^5]: [Paul Newman's Beer Drinking - Psychology Today](https://www.psychologytoday.com/us/blog/addiction-in-society...) [^6]: [Paul Newman's Beer Drinking | Psychology Today United Kingdom](https://www.psychologytoday.com/gb/blog/addiction-in-society...) [^7]: [The troubled life of Paul Newman | The Spectator](https://www.spectator.co.uk/article/the-troubled-life-of-pau...)
> "According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating."
'though it had nothing to do with cheating' is a weird inclusion.
We all know this isn't the kind of question you ask an LLM...
You may know this, but that doesn't mean we all do. LLMs are marketed towards many different groups, including folks who don't have much tech knowledge.
"known to" !== "known for"
So, in other words, are you saying that AI model progress is the real deal and is not bullshit?
That is, as you point out, "all of the models up to o3-mini-high" give an incorrect answer, while other comments say that OpenAIs later models give correct answers, with web citations. So it would seem to follow that "recent AI model progress" actually made a verifiable improvement in this case.
I am pretty sure that they must have meant "up through", not "up to", as the answer from o3-mini-high is also wrong in a way which seems to fit the same description, no?
I tried with 4o and it gave me what I thought was a correct answer:
> Paul Newman was not publicly known for having major problems with alcohol in the way some other celebrities have been. However, he was open about enjoying drinking, particularly beer. He even co-founded a line of food products (Newman’s Own) where profits go to charity, and he once joked that he consumed a lot of the product himself — including beer when it was briefly offered.
> In his later years, Newman did reflect on how he had changed from being more of a heavy drinker in his youth, particularly during his time in the Navy and early acting career, to moderating his habits. But there’s no strong public record of alcohol abuse or addiction problems that significantly affected his career or personal life.
> So while he liked to drink and sometimes joked about it, Paul Newman isn't generally considered someone who had problems with alcohol in the serious sense.
As other's have noted, LLMs are much more likely to be cautious in providing information that could be construed as libel. While Paul Newman may have been an alcoholic, I couldn't find any articles about it being "public" in the same way as others, e.g. with admitted rehab stays.
oh no. LLMs aren't up on the latest celebrity gossip. whatever shall we do.
Unless you're expecting an LLM to have access to literally all information on earth at all times I find it really hard to care about this particular type of complaint.
My calculator can't conjugate German verbs. That's fine IMO. It's just a tool
Your calculator won't confidently pretend to conjugate German verbs while doing so incorrectly, though.
Yes but a tool for what? When asked a question individuals that don't already have detailed knowledge of a topic are left with no way to tell if the AI generated response is complete bullshit, uselessly superficial, or detailed and on point. The only way to be sure is to then go do the standard search engine grovel looking for authoritative sources.
Im able to get substantially more coding done than three months ago. This could be largely in the tooling (coding agents, deep research). But the models are better too, for both coding and brainstorming. And tooling counts, to me, as progress.
Learning to harness current tools helps to harness future tools. Work on projects that will benefit from advancements, but can succeed without them.
I'm not sure if I'm able to do more of the hard stuff, but a lot of the easy but time consuming stuff is now easily done by LLMs.
Example: I frequently get requests for data from Customer Support that used to require 15 minutes of my time noodling around writing SQL queries. I can cut that down to less than a minute now.
Yes I am a better engineer with every release. I think this is mostly empirically validated
I've suggested (from my lowly layman vantage point) that an LLM has some idea of the fidelity of their response to a query — even if only in broad strokes like, "This answer is tracking with some high probability numbers" or "This answer has a very low correspondence with training data".
To that end the LLM could convey as much.
(Anecdotal, sorry:) I was using Claude (not paid) recently and noticed Claude hedging quite a bit when it had not before. Examples:
"Let me be careful about this response since we're discussing a very specific technical detail ..."
"Given how specific that technical detail is, I want to be transparent that while I aim to be accurate, I may hallucinate such precise historical specifications."
I confess my initial reaction was to ask ChatGPT since the answers are more self-assured, ha ha. So perhaps corporate AI are not likely to try and solve this problem of the LLM telling the user when it is on shaky ground. Bad for business.
I think the real meaningful progress is getting ChatGPT 3.5 level quality running anywhere you want rather than AIs getting smarter at high level tasks. This capability being ubiquitous and not tied to one vendor is really what’s revolutionary.
> [T]here are ~basically~ no public benchmarks for security research... nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.
A few months ago I looked at essentially this problem from a different angle (generating system diagrams from a codebase). My conclusion[0] was the same as here: LLMs really struggle to understand codebases in a holistic way, especially when it comes to the codebase's strategy and purpose. They therefore struggle to produce something meaningful from it like a security assessment or a system diagram.
[0] https://www.ilograph.com/blog/posts/diagrams-ai-can-and-cann...
> Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.
This is likely a manifestation of the bitter lesson[1], specifically this part:
> The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project [like an incremental model update], massively more computation inevitably becomes available.
(Emphasis mine.)
Since the ultimate success strategy of the scruffies[2] or proponents of search and learning strategies in AI is Moore's Law, short term gains using these strategies will be miniscule. It is over at least a five year period that their gains will be felt the most. The neats win the day in the short term, but the hare in this race will ultimately give away to the steady plod of the tortoise.
1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
2: https://en.m.wikipedia.org/wiki/Neats_and_scruffies#CITEREFM...
Will LLMs end up like compilers? Compilers are also fundamentally important to modern industrial civilization - but they're not profit centers, they're mostly free and open-source outside a few niche areas. Knowing how to use a compiler effectively to write secure and performative software is still a valuable skill - and LLMs are a valuable tool that can help with that process, especially if the programmer is on the steep end of the learning curve - but it doesn't look like anything short of real AGI can do novel software creation without a human constantly in the loop. The same argument applies to new fundamental research, even to reviewing and analyzing new discoveries that aren't in the training corpus.
Wasn't it back in the 1980s that you had to pay $1000s for a good compiler? The entire LLM industry might just be following in the compiler's footsteps.
This seems like a probable end state, but we're going to have to stop calling LLMs "artificial intelligence" in order to get there.
Yep. I'm looking forward to LLMs/deepnets being considered a standard GOFAI technique with uses and limitations and not "we asked the God we're building to draw us a picture of a gun and then it did and we got scared"
Why not? Objectively speaking LLMs are artificial intelligent. Just because it's not human level intelligence doesn't mean it's not intelligent.
Objectively speaking a chess engine is artificially intelligent. Just because it's not human level doesn't mean it's not intelligent. Repeat for any N of 100s of different technologies we've built. We've been calling this stuff "thinking machines" since Turing and it's honestly just not useful at this point.
The fact is, the phrase "artificial intelligence" is a memetic hazard: it immediately positions the subject of conversation as "default capable", and then forces the conversation into trying to describe what it can't do, which is rarely a useful way to approach it.
Whereas with LLMs (and chess engines and every other tech advancement) it would be more useful to start with what the tech _can_ do and go from there.
The foundational idea of AI is that people think with words. But it is false. People think with things and use words only to point to things. Thinking with words is not possible. It will always be a case of shizophasia.
The "things" you mention may correspond to internal concept representations encoded in the model's weights. See e.g. https://arxiv.org/abs/2206.13289
Not sure if its been fixed by now but a few weeks ago I was in the Golden Gate park and wondered if it was bigger than Central park. I asked ChatGPT voice, and although it reported the sizes of the parks correctly (with Golden gate park being the bigger size), it then went and said that Central Park was bigger. I was confused, so Googled and sure enough Golden gate park is bigger.
I asked Grok and others as well. I believe Perplexity was the only one correct.
Repeated it multiple times even with a friends account. It kept doing the same thing. It knew the sizes, but thought the smaller sized one was bigger...
I just tried. Claude did exactly what you said, and then figured it out:
Central Park in New York City is bigger than GoldenGate Park (which I think you might mean Golden Gate Park) in San Francisco.
Central Park covers approximately 843 acres (3.41 square kilometers), while Golden Gate Park spans about 1,017 acres (4.12 square kilometers). This means Golden Gate Park is actually about 20% larger than Central Park.
Both parks are iconic urban green spaces in major U.S. cities, but Golden Gate Park has the edge in terms of total area.
Probably because it has read the facts but has no idea how numbers actually work.
Two things can be true at the same time:
1. Model "performance" judged by proxy metrics of intelligence have improved significantly over the past two years.
2. These capabilities are yet to be stitched together in the most appropriate manner for the cybersecurity scenarios the author is talking about.
In my experience, the best usage of Transformer models has come from a deep integration into an appropriate workflow. They do not (yet) replace the new exploration part of a workflow, but they are very scarily performant at following mid level reasoning assertions in a massively parallelized manner.
The question you should be asking yourself is if you can break down your task into however many small chunks that are constrained by feasiility in time to process , chunk those up into appropriate buckets or even better, place them in-order as though you were doing those steps with your expertise - an extension of self. Here's how the two approaches differ:
"Find vulnerabilities in this code" -> This will saturate across all models because the intent behind this mission is vast and loosely defined, while the outcome is expected to be narrow.
" (a)This piece of code should be doing x, what areas is it affecting, lets draw up a perimeter (b) Here is the dependency graph of things upstream and downstream of x, lets spawn a collection of thinking chains to evaluate each one for risk based on the most recent change . . . (b[n]) Where is this likely to fail (c) (Next step that a pentester/cybersecurity researcher would take) "
This has been trial and error in my experience but it has worked great in domains such as financial trading and decision support where experts in the field help sketch out the general framework of the process where reasoning support is needed and constantly iterate as though it is an extension of their selves.
It seems like the models are getting more reliable at the things they always could do, but they’re not showing any ability to move past that goalpost. Whereas in the past, they could occasionally write some very solid code, but often return nonsense, the nonsense is now getting adequately filtered by so-called “reasoning”, but I see no indication that they could do software design.
> how the hell is it going to develop metrics for assessing the impact of AIs when they're doing things like managing companies or developing public policy?
Why on earth do people want AI to do either of these things? As if our society isn’t fucked enough, having an untouchable oligarchy already managing companies and developing public policies, we want to have the oligarchy’s AI do this, so policy can get even more out of touch with the needs of common people? This should never come to pass. It’s like people read a pile of 90s cyberpunk dystopian novels and decided, “Yeah, let’s do that.” I think it’ll fail, but I don’t understand how anyone with less than 10 billion in assets would want this.
> Why on earth do people want AI to do either of these things?
This is the really important question, and the only answer I can drum up is that people have been fed a consistent diet of propaganda for decades centered around a message that ultimately boils down to a justification of oligarchy and the concentration of wealth. That and the consumer-focus facade makes people think the LLMS are technology for them—they aren't. As soon as these things get good enough business owners aren't going to expect workers to use them to be more productive, they are just going to fire workers and/or use the tooling as another mechanism by which to let wages stagnate.
LLM's are pre-trained to minimize perplexity (PPL), which essentially means that they're trained to model the likelihood distribution of the next words in a sequence.
The amazing thing was that minimizing PPL allowed you to essentially guide the LLM output and if you guided it in the right direction (asked it questions), it would answer them pretty well. Thus, LLMs started to get measured on how well they answered questions.
LLMs aren't trained from the beginning to answer questions or solve problems. They're trained to model word/token sequences.
If you want an LLM that's REALLY good at something specific like solving math problems or finding security bugs, you probably have to fine tune.
Government announces critical need to invest in AI and sets aside a bunch of money for this purpose.
Suddenly the benchmarks become detached from reality and vendors can claim whatever they want about their "new" products.
Just as a possible explanation, as I feel like I've seen this story before.
> [...] But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. [...]
Seems like they're looking at how they fail and not considering how they're improving in how they succeed.
The efficiency in DeepSeek's Multi-Head Latent Attention[0] is pure advancement.
[0] https://youtu.be/0VLAoVGf_74?si=1YEIHST8yfl2qoGY&t=816
In practice, Sonnet 3.7 and Gemini 2.5 are just often too good compared to competitors.
It’s not even approaching the asymptotic line of promises made at any achievable rate for the amount of cash being thrown at it.
Where’s the business model? Suck investors dry at the start of a financial collapse? Yeah that’s going to end well…
> where’s the business model?
For who? Nvidia sell GPUs, OpenAI and co sell proprietary models and API access, and the startups resell GPT and Claude with custom prompts. Each one is hoping that the layer above has a breakthrough that makes their current spend viable.
If they do, then you don’t want to be left behind, because _everything_ changes. It probably won’t, but it might.
That’s the business model
That’s not a business model, it’s a pipe dream.
This bubble will be burst by the Trump tariffs and the end of the zirp era. When inflation and a recession hit together hope and dream business models and valuations no longer work.
Which one? Nvidia are doing pretty ok selling GPU's, and OpenAI and Anthropic are doing ok selling their models. They're not _viable_ business models, but they could be.
NVDA will crash when the AI bubble implodes, and none of those Generative AI companies are actually making money, nor will they. They have already hit limiting returns in LLM improvements after staggering investments and it is clear are nowhere near general intelligence.
All of this can be true, and has nothing to do with them having a business model.
> NVDA will crash when the AI bubble implodes, > making money, nor will they > They have already hit limiting returns in LLM improvements after staggering investments > and it is clear are nowhere near general intelligence.
These are all assumptions and opinions, and have nothing to do with whether or not they have a business model. You mightn't like their business model, but they do have one.
I consider it a business model if they have plans to make money at some point (no sign of that at openai which are not based on hopium) and are not engaged in fraud like bundling and selling to their own subsidiaries (nvda).
These are of course just opinions, I’m not sure we can know facts about such companies except in retrospect.
Yep. Facts are usually found out during the SEC investigation but we know that isn't going to happen now...
They are doing OK in the sense that they are spending a dollar to earn a dime I suppose.
The ZIRP era ended several years ago.
Yes it did, but the irrational exuberance was ongoing till this trigger.
Now we get to see if Bitcoin’s use value of 0 is really supporting 1.5 trillion market cap and if OpenAI is really worth $300 billion.
I mean softbank just invested in openai, and they’ve never been wrong, right?
You missed the end of the supply chain. Paying users. Who magically disappear below market sustaining levels of sales when asked to pay.
> Going from $1M ARR to $100M ARR in 12 months, Cursor is the fastest growing SaaS company of all time
Just because it's not reaching the insane hype being pushed doesn't mean it's totally useless
I've been here a long time (not this account) and have heard this many times. They all died or became irrelevant.
You’re on a startup forum complaining that vc backed startups don’t have a business model when the business model is the same as it has been for almost 15 years - be a unicorn in your space.
This is not a unicorn. It's a donkey with a dildo strapped on its head.
I never said it was sustainable, and even if it was, OP asked for a business model. Customers don’t need a business model, they’re customers.
The same is true for any non essential good or service.
Than any silly idea can be a business model. Suppose I collect dust from my attic and hope to sell it as an add-on on my neighbor's lemonade stand, with a hefty profit for the neighbor, who is getting paid by me $10 to add a handful of dust in each glass and sell it to the customers for $1. The neighbor accepts. It's a business model, at least until I don't run of existing funds or the last customer leaves in disguist. At which point exactly that silly idea stops being an unsustainable business model and becomes a silly idea? I guess at least as early as I see that the funds are running up, and I need to borrow larger an larger lumps of money each time to keep spinning the wheel...
> Than any silly idea can be a business model.
Indeed it can. The difference between a business model and a viable business model is one word - viable.
If you asked me 18 years ago was "giving away a video game and selling cosmetics" a viable business model I would have laughed at you.If you asked me in 2019 I would probably give you money. If you asked me in 2025, I'd probably laugh at you again.
> and I need to borrow larger an larger lumps of money each time to keep spinning the wheel...
Or you figure out a way to to sell it to your neighbour for $0.50 and he can sell it on for $1.
The play is clear at every level - Nvidia Sell GPUs, OpenAI sell models, and SAAS sell prompts + UI's. Whether or not any of them are viable remains to be seen. Personally, I wouldn't take the bet.
Financial collapse ? Surely we can just roll out AI powered money printers and make them go BRRR /s
Original post here:
https://zeropath.com/blog/on-recent-ai-model-progress
I first used GPT 2.5 many years ago through a google colab notebook. These newer models aren’t all that much better. I’m sure they do better on their tests because they’re trained to beat them but really the biggest “innovation” was a UI and API.
I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.
I don't think this is a problem though. I think there's a lot of low-hanging fruit when you create sophisticated implementations of relatively dumb LLM models. But that sentiment doesn't generate a lot of clicks.
Who would assume that LLM companies were to hyper optimise on public to make their share prices go up and bubble keep afloat ... What a unserious thought to maintain ...
There's the politics of the corporations and then there's the business of the science behind LLM's, this article feels like the former.
Maybe someone active in the research can comment? I feel like all of these comments are just conjecture/anecdotal and don't really get to the meat of this question of "progress" and the future of LLM's
So I guess this was written pre-Gemini 2.5
Meh. I've been using 2.5 with Cline extensively and while it is better it's still an incremental improvement, not something revolutionary. The thing has a 1 million token context window but I can only get a few outputs before I have to tell it AGAIN to stop writing comments.
Are they getting better, definitely. Are we getting close to them performing unsupervised tasks, I don't think so.
more interesting benchmarks:
- Make as much money as you can in a 24 hour period doing only legal, moral and non-annoying things
- Make a depressed person who calls a suicide hotline feel really happy doing only legal, moral and non-annoying things
- Do something really useful for society with measurable outcomes doing only legal, moral and non-annoying things
> But in recent months I've spoken to other YC founders doing AI application startups [...] in different industries, on different problem sets.
Maybe they should create a benchmark collectively called YC founders. Gather various test cases. Never make it public. And use that to evaluate newly released models.
I like this bit:
> Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.
Definitely interested to see how the best models from Anthropics competitors do at this.,
Assuming that the models getting better at SWE benchmarks and math tests would translate into positive outcomes in all other domains could be an act of spectacular hubris by the big frontier labs, which themselves are chock-full of mathematicians and software engineers.
I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.
I wonder if there’s a parallel universe somewhere where this same HN post has the exact same comments, only the posts are by NNs discussing human test subjects :)
> Sometimes the founder will apply a cope to the narrative ("We just don't have any PhD level questions to ask")
Please tell me this is not what tech-bros are going around telling each other! Are we implying that the problems in the world, the things that humans collectively work on to maintain the society that took us thousands of years to build up, just aren't hard enough to reach the limits of the AI.
Jesus Christ.
I mean... most businesses, particularly small businesses and startups, aren't exactly doing brain surgery on a rocketship.
It’s pretty likely that they have extremely dull problems like "running an inbound call center is a lot of work" or "people keep having their mail stolen and/or lying that they did" that "more smarter gpus" won't solve
Author will need to rewrite now that Llama 4 is out
> So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. [...then says maybe not...]
Well.. they've been caught again and again red handed doing exactly this. Fool me once shame on you, fool me 100 times shame on me.
Hate to say this but the incentive is growth, not progress. Progress is what enabled the growth, but is also extremely hard to plan and deliver. On the other hand, hype is probably somewhat easier and well-tested approach so no surprise lot of the effort goes into marketing. Markets had repeatedly confirmed that there aren't any significant immediate repercussions for cranking up BS levels in marketing materials, while there are some rewards when it works.
Fool me once, shame on you...If fooled, you cant get fooled again.
https://www.youtube.com/shorts/LmFN8iENTPc
Missing in the article: "Disclaimer: This article was written before testing Gemini 2.5 PRO"
There's some interesting information and analysis to start off this essay, then it ends with:
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
This sounds like an article of faith to me. One could just as easily say they won't become the beating hearts of anything, and instead we'll choose to continue to build a better future for humans, as humans, without relying on an overly-hyped technology rife with error and unethical implications.
While I really want your version of the future, history and general trajectory and tendency unfortunately suggest otherwise.
Wait this sounds like one of those actors spamming and wasting open source time with fake security issues.
For three years now, my experience with LLMs has been "mostly useless, prefer ELIZA".
Which is software written 1966, but the web version is a little newer. Does occasional psychotherapy assistance/brainstorming just as well, and I more easily know when I stepped out of its known range into the extrapolated.
That said, it can vibe code in a framework unknown to me in half the time that I would need to school myself and add the feature.
Or vibe coding takes twice as long, if I mostly know how to achieve what I want and read no framework documentation but only our own project's source code to add a new feature. But on a day with a headache, I can still call the LLM a dumb twat and ask it to follow my instructions instead of doing bullshit.
But, vibe coding always makes my pulse go to 105, from 65 and question my life choices. Since few instructions are rarely ever followed and loops never left once entered. Except for on the first try getting 80% of the structure kinda right, but then getting stuck for the whole workday.
current AI models are searching engines that can somewhat extrapolate data and use in the current context, that's pretty much it
Reading the comments, is it safe to say that LLMs are a digest of the internet which is some update over google search, but with the caveat that you need to double check the results? I mean they basically have some compressed version of almost all the written knowledge and will respond correctly about things that have already been written, and hallucinate (extrapolate) about things not explicitly written. Of course if someone carefully curates the input data to filter out misinformation, it might even be an upgrade over google. Is there a consensus on this?
From my own experience on a codebase using a lot of custom algorithms on trees and sometimes graphs.
There were qualitatively leaps in my day-to-day usage:
Claude Sonnet 3.5 and ChatGPT O1 were good for writing slop and debugging simple bugs
Grok Thinking and Sonnet 3.7 were good to catch mildly complicated bugs and write functions with basic logic. They still made mistake
But recently, Gemini 2.5 pro has been scary good. I liked to made fun of the feel-the-AGI crowd but for the first time a model made me raise an eyebrow
It can one shot unusual function with complicated logic and subtle edge cases
I honestly can’t notice any difference in outdoor quality between GPT 4o and GPT 4.5. I also can’t notice any difference in programming quality in cursor when using Claude 3.7 vs 3.5. I’m told there is a clear difference, but I don’t notice it.
Oh yes, before we've designated all the messy parts of running a society to the machines, like, you know, art.
I'm this case the goal is to kill all the humans who know a lot about keeping other people safe...
> AI lab founders believe they are in a civilizational competition for control of the entire future lightcone, and will be made Dictator of the Universe if they succeed.
Bingo, but I'd argue this is only scratching the surface of how twisted things are.
A lot of the stuff these labs put out (see: Amodei's cult-like blog ramblings) reeks of what I call "sociopathic utopianism" - essentially, the logical extreme of ends-justified means, made worse in the context of AI labs by the singularity pseudo-religion.
They all truly believe that AGI/ASI is possible, imminent, and could lead to utopia... so achieving that goal will surely outweigh any unsavory acts they commit in the pursuit of it.
This is why I think it's possible OpenAI took out a hit on Suchir; getting bogged down in a legal battle could delay the arrival of their machine god messiah. Same for defrauding benchmarks - they just need a "few more rounds" of investor cash, and by the time those run out, they'll surely have AGI on lock!
Fools. I look forward to them all getting prison time.
… deleted … (Sorry the delete isn’t working, meant for another subthread)
I asked Grok to review the comments here and generate a response defending AI:
After reviewing the discussion on the Hacker News thread, it’s clear that there are a range of complaints and criticisms about AI, particularly centered around its limitations, overhype, and practical utility. Some users express frustration with AI’s inability to handle complex reasoning, its tendency to produce generic or incorrect outputs, and the perception that it’s more of a buzzword than a transformative tool. Others question its value compared to traditional methods or human expertise, suggesting it’s overhyped or misapplied in many cases. Below, I’ll offer a defense of AI that addresses these concerns while highlighting its strengths and potential.
AI isn’t perfect, and no one should claim it is—but that’s not the point. It’s a tool, and like any tool, its effectiveness depends on how it’s used. Critics who point to AI’s struggles with nuanced reasoning or edge cases often overlook the fact that it’s not designed to replace human judgment entirely. Instead, it excels at augmenting it. For example, AI can process vast amounts of data—far more than any human could in a reasonable timeframe—and identify patterns or insights that might otherwise go unnoticed. This makes it invaluable in fields like medicine, where it’s already helping diagnose diseases from imaging data, or in logistics, where it optimizes supply chains with precision that manual methods can’t match.
The complaint about generic or incorrect outputs, often dubbed “hallucinations,” is fair but misses the bigger picture. Yes, AI can churn out nonsense if pushed beyond its limits or fed poor data—but that’s a reflection of its current stage of development, not its ultimate potential. These systems are improving rapidly, with each iteration reducing errors and refining capabilities. More importantly, AI’s ability to generate starting points—like drafts, code snippets, or hypotheses—saves time and effort. It’s not about delivering flawless results every time; it’s about accelerating the process so humans can refine and perfect the output. A programmer tweaking AI-generated code is still faster than writing it from scratch.
As for the overhype, it’s true that the buzz can get out of hand—marketing teams love a shiny new toy. But beneath the noise, real progress is happening. AI’s contributions aren’t always flashy; they’re often mundane but critical, like automating repetitive tasks or enhancing search algorithms. The critics who say it’s just a “fancy autocomplete” underestimate how transformative that can be. Autocomplete might sound trivial until you realize it’s powering real-time language translation or helping scientists sift through research papers at scale. These aren’t sci-fi fantasies—they’re practical applications delivering value today.
Finally, the notion that AI can’t match human expertise in complex domains ignores its complementary role. It’s not here to outthink a seasoned expert but to amplify their reach. A lawyer using AI to review contracts doesn’t lose their skill—they gain efficiency. A researcher leveraging AI to analyze data doesn’t stop hypothesizing—they get to test more ideas. The technology’s strength lies in its ability to handle the grunt work, freeing humans to focus on creativity and judgment.
AI isn’t a silver bullet, and it’s not without flaws. But the criticisms often stem from inflated expectations or a failure to see its incremental, practical benefits. It’s not about replacing humans—it’s about making us better at what we do. The trajectory is clear: as it evolves, AI will continue to refine its capabilities and prove its worth, not as a standalone genius, but as a partner in progress.