Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.
All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).
Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).
Surely the corpus Opus 4.6 ingested would include whatever reference you used to check the spells were there. I mean, there are probably dozens of pages on the internet like this:
Do you think it's actually ingesting the books and only using those as a reference? Is that how LLMs work at all? It seems more likely it's predicting these spell names from all the other references it has found on the internet, including lists of spells.
Most people still don't realize that general public world knowledge is not really a test for a model that was trained on general public world knowledge. I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data, despite what publishers and authors may think of that. As a matter of fact, with all the special deals these companies make with publishers, it is getting harder and harder for normal users to come up with validation data that only they have seen. At least for human written text, this kind of data is more or less reserved for specialist industries and higher academia by now. If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.
> I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data
No need for surprises! It is publicly known that the corpus of 'shadow libraries' such as Library Genesis and Anna's Archive were specifically and manually requested by at least NVIDIA for their training data [1], used by Google in their training [2], downloaded by Meta employees [3] etc.
The big AI houses are all in involved in varying degrees of litigation (all the way to class action lawsuits) with the big publishing houses. I think they at least have some level of filtering for their training data to keep them legally somewhat compliant. But considering how much copyrighted stuff is spread blisfully online, it is probably not enough to filter out the actual ebooks of certain publishers.
> even proprietary content like the books themselves
This definitely raises an interesting question. It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files. Immediately to mind was House of Leaves, Infinite Jest, Harry Potter, basically any Stephen King book - they've all been posted at some point.
Do LLMS have a good way of inferring where knowledge from the context begins and knowledge from the training data ends?
> If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.
It's impressive, even if the books and the posts you're talking about were both key parts of the training data.
There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).
This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.
(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)
I don’t think this example proves your point. There’s no indication that the model actually worked this out from the input context, instead of regurgitating it from the training weights. A better test would be to subtly modify the books fed in as input to the model so that there was actually 51 spells, and see if it pulls out the extra spell, or to modify the names of some spells, etc.
In your example, it might be the case that the model simply spits out consensus view, rather than actually finding/constructing this information on his own.
So a good test would be replacing the spell names in the books with made-up spells. And if a "real" spell name was given, it also tests whether it "cheated".
A real test is synthesizing 100,000 sentences of this slect random ones and then inject the traits you want thr LLM to detect and describe, eg have a set of words or phrases that may represent spells and have them used so that they do something. Then have the LLM find these random spells in the random corpus.
It could still remember where each spell is mentioned. I think the only way to properly test this would be to run it against an unpublished manuscript.
For fun I've asked Gemini Pro to answer open ended questions about obscure books like "Read this novel and tell me what the hell is this book, do a deep reading and analyze" and I've gotten insightful/ enjoyable answers but I've never asked it to make lists of spells or anything like that.
The poster you reply to works in AI. The marketing strategy is to always have a cute Pelican or Harry Potter comment as the top comment for positive associations.
The poster knows all of that, this is plain marketing.
This sounds compelling, but also something that an armchair marketer would have theorycrafted without any real-world experience or evidence that it actually works - and I searched online and can't find any references to something like it.
I was playing about with Chat GPT the other day, uploading screen shots of sheet music and asking it to convert it to ABC notation so I could make a midi file of it.
The results seemed impressive until I noticed some of the "Thinking" statements in the UI.
One made it apparent the model / agent / whatever had read the title from the screenshot and was off searching for existing ABC transcripts of the piece Ode to Joy.
So the whole thing was far less impressive after that, it wasn't reading the score anymore, just reading the title and using the internet to answer my query.
I am not sure why...you want the LLM to solve problems not come up with answers itself. It's allowed to use tools, precisely because it tends to make stuff up. In general, only if you're benchmarking LLMs you care about whether the LLM itself provided the answer or it used a tool. If you ask it to convert the notation of sheet music it might use a tool, and it's probably the right decision.
The shortcut is fine if it's a bog standard canonical arrangement of the piece. If it's a custom jazz rendition you composed with an odd key changes and and shifting time signatures, taking that shortcut is not going to yield the intended result. It's choosing the wrong tool to help which makes it unreliable for this task.
Yes I have found that grok for example actually suddenly becomes quite sane when you tell it to stop querying the internet And just rethink the conversation data and answer the question.
It's weird, it's like many agents are now in a phase of constantly getting more information and never just thinking with what they've got.
but isn't it what we wanted?
we complained so much that LLM uses deprecated or outdated apis instead of current version because they relied so much on what they remembered
Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.
Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.
> Borges's "review" describes Menard's efforts to go beyond a mere "translation" of Don Quixote by immersing himself so thoroughly in the work as to be able to actually "re-create" it, line for line, in the original 17th-century Spanish. Thus, Pierre Menard is often used to raise questions and discussion about the nature of authorship, appropriation, and interpretation.
This really makes me want to try something similar with content from my own website.
I shut it down a while ago because the number of bots overtake traffic. The site had quite a bit of human traffic (enough to bring in a few hundred bucks a month in ad revenue, and a few hundred more in subscription revenue), however, the AI scrapers really started ramping up and the only way I could realistically continue would be to pay a lot more for hosting/infrastructure.
I had put a ton of time into building out content...thousands of hours, only to have scrapers ignore robots, bypass cloudflare (they didn't have any AI products at the time), and overwhelm my measly infrastructure.
Even now, with the domain pointed at NOTHING, it gets almost 100,000 hits a month. There is NO SERVER on the other end. It is a dead link. The stats come from Cloudflare, where the domain name is hosted.
I'm curious if there are any lawyers who'd be willing to take someone like me on contingency for a large copyright lawsuit.
Can we help get your infra cost down to negligible? I'm thinking things like pre-generated static pages and CDNs. I won't assume you hadn't thought of this before, but I'd like to understand more where your non-trivial infra cost come from?
I would be tempted to try and optimise this as well. 100000 hits on an empty domain and ~200 dollars worth of bot traffic sounds wild. Are they using JS-enabled browsers or sim farms that download and re-download images and videos as well?
a) As an outside observer, I would find such a lawsuit very interesting/valuable. But I guess the financial risk of taking on OpenAI or Anthropic is quite high.
b) If you don't want bots scraping your content and DDOSing you, there are self-hosted alternatives to Cloudflare. The simplest one that I found is https://github.com/splitbrain/botcheck - visitors just need to press a button and get a cookie that lets them through to the website. No proof-of-work or smart heuristics.
What's not clear from the study (at least skimming it) is if they always started the ball rolling with ground truth passages or if they chained outputs from the model until they got to the end of the book. I strongly suspect the latter would hopelessly corrupt relatively quickly.
It seems like this technique only works if you have a copy of the material to work off of, i.e. enter a ground truth passage, tell the model to continue it as long as it can, and then enter the next ground truth passage to continue in the next session.
A fine instruction following task but if harry potter is in the weights of the neural net, it's going to mix some of the real ones with the alternates.
It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.
Being that it has the books memorized (huh, just learned another US/UK spelling quirk), I would suppose feeding it the books with altered spells would get you a confused mishmash of data in the context and data in the weights.
It's not going to be as consistent. It may get bored of listing them (you know how you can ask for many examples and get 10 in response?), or omit some minor ones for other reasons.
By replacing the names with something unique, you'll get much more certainty.
might not work well, but by navigating to a very harry potter dominant part of latent space by preconditioning on the books you make it more likely to get good results. An example would be taking a base model and prompting "what follows is the book 'X'" it may or may not regurgitate the book correctly. Give it a chunk of the first chapter and let it regurgitate from there and you tend to get fairly faithful recovery, especially for things on gutenberg.
So it might be there, by predcondiditioning latent space to the area of harry potter world, you make it so much more probable that the full spell list is regurgitated from online resources that were also read, while asking naive might get it sometimes, and sometimes not.
the books act like a hypnotic trigger, and may not represent a generalized skill. Hence why replacing with random words would help clarify. if you still get the origional spells, regurgitation confirmed, if it finds the spells, it could be doing what we think. An even better test would be to replace all spell references AND jumble chapters around. This way it cant even "know" where to "look" for the spell names from training.
No, because you don't know the magic spell (forgive me) of context that can be used to "unlock" that information if it's stored in the NN.
I mean, you can try, but it won't be a definitive answer as to whether that knowledge truly exists or doesn't exist as it is encoded into the NN. It could take a lot of context from the books themselves to get to it.
This underestimates how much of the Internet is actually compressed into and is an integral part of the model's weights. Gemini 2.5 can recite the first Harry Potter book verbatim for over 75% of the book.
Iirc it's not quite true. 75% of the book is more likely to appear than you would expect by chance if prompted with the prior tokens. This suggests that it has the book encoded in its weights, but you can't actually recover it by saying "recite harry potter for me".
> But for sure it has some internal knowledge already.
Pretty sure the books had to be included in its training material in full text. It's one of the most popular book series ever created, of course they would train on it. So "some" is an understatement in this case.
Do the same experiment in the Claude web UI. And explicitly turn web searches off. It got almost all of them for me over a couple of prompts. That stuff is already in its training data.
The only worthwhile version of this test involves previously unseen data that could not have been in the training set. Otherwise the results could be inaccurate to the point of harmful.
I'm not sure what your knowledge level of the inner workings of LLMs is, but a model doesn't need search or even an internet connection to "know" the information if it's in its training dataset. In your example, it's almost guaranteed that the LLM isn't searching books - it's just referencing one of the hundreds of lists of those spells in it's training data.
This is the LLM's magic trick that has everyone fooled into thinking they're intelligent - it can very convincingly cosplay an intelligent being by parroting an intelligent being's output. This is equivalent to making a recording of Elvis, playing it back, and believing that Elvis is actually alive inside of the playback device. And let's face it, if a time traveler brought a modern music playback device back hundreds of years and showed it to everyone, they WOULD think that. Why? Because they have not become accustomed to the technology and have no concept of how it could work. The same is true of LLMs - the technology was thrust on society so quickly that there was no time for people to adjust and understand its inner workings, so most people think it's actually doing something akin to intelligence. The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.
>The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.
A music playback device's purpose is to allow you hear Elvis' voice. A good device does it well: you hear Elvis' voice (maybe with some imperfections). Whether a real Elvis is inside of it or not, doesn't matter - its purpose is fulfilled regardless. By your analogy, an LLM simply reproduces what an intelligent person would say on the matter. If it does its job more-less, it doesn't matter either, whether it's "truly intelligent" or not, its output is already useful. I think it's completely irrelevant in both cases to the question "how well does it do X?" If you think about it, 95% we know we learned from school/environment/parents, we didn't discover it ourselves via some kind of scientific method, we just parrot what other intelligent people said before us, mostly. Maybe human "intelligence" itself is 95% parroting/basic pattern matching from training data? (18 years of training during childhood!)
Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.
Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text
Just a few casual thoughts.
I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.
Having AI spew it might suffer from the fact that the spew itself is influenced by AI's weights. I think your best bet would be to use a new human-authored work that was released after the model's context cutoff.
use AI to rewrite all the spells from all the books, then try to see if AI can detect the rewritten ones. This will ensure it's not pulling from it's trained data set.
Its a test. Like all tests, its more or less synthetic and focused on specific expected behavior. I am pretty far from llms now but this seems like a very good test to see how geniune this behavior actually is (or repeat it 10x with some scramble for going deeper).
This thread is about the find-and-replace, not the evaluation. Gambling on whether the first AI replaces the right spells just so the second one can try finding them is unnecessary when find-and-replace is faster, easier and works 100%.
... I'm not sure if you're trolling or if you missed the point again. The point is to test the contextual ability and correctness of the LLMs ability's to perform actions that would be hopefully guaranteed to not be in the training data.
It has nothing to do about the performance of the string replacement.
The initial "Find" is to see how well it performs actually find all the "spells" in this case, then to replace them. They using a separate context maybe, evaluate if the results are the same or are they skewed in favour of training data.
What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.
There was a time when I put the EA-Nasir text into base64 and asked AI to convert it. Remarkably it identified the correct text but pulled the most popular translation of the text than the one I gave it.
Sucks that you got a really shitty response to your prompt. If I were you, the model provider would be receiving my complaint via clay tablet right away.
You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.
Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.
It feels like a very odd test because it's such an unreasonable way to answer this with an LLM. Nothing about the task requires more than a very localized understanding. It's not like a codebase or corporate documentation, where there's a lot of interconectedness and context that's important. It also doesn't seem to poke at the gap between human and AI intelligence.
> Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.
Clearly a very useful, grounded and helpful everyday use case of LLMs. I guess in the absence of real-world use cases, we'll have to do AI boosting with such "impressive" feats.
Btw - a well crafted regex could have achieved the same (pointless) result with ~0.0000005% of resources the LLM machine used.
To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:
> The smug look on Malfoy’s face flickered.
> “No one asked your opinion, you filthy little Mudblood,” he spat.
> Harry knew at once that Malfoy had said something really bad because
there was an instant uproar at his words. Flint had to dive in front of Malfoy
to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”,
and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll
pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at
Malfoy’s face.
> A loud bang echoed around the stadium and a jet of green light shot out of
the wrong end of Ron’s wand, hitting him in the stomach and sending him
reeling backward onto the grass.
> “Ron! Ron! Are you all right?” squealed Hermione.
> Ron opened his mouth to speak, but no words came out. Instead he gave an
almighty belch and several slugs dribbled out of his mouth onto his lap.
I have a vague recollection that it might come up named as such in Half-Blood Prince, written in Snape's old potions textbook?
In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.
Hmm, I don't get a hit for "slugulus" or "eructo" (case insensitive) in any of the 7. Interestingly two mentions of "vomit" are in book 6, but neither in reference to to slugs (plenty of Slughorn of course!). Book 5 was the only other one a related hit came up:
> Ron nodded but did not speak. Harry was reminded forcibly of the time that Ron had accidentally put a slug-vomiting charm on himself. He looked just as pale and sweaty as he had done then, not to mention as reluctant to open his mouth.
There could be something with regional variants but I'm doubtful as the Fandom site uses LEGO Harry Potter: Years 1-4 as the citation of the spell instead of a book.
Maybe the real LLM is the universe and we're figuring this out for someone on Slacker News a level up!
now thx to this post (and the infra provider inclination to appeal to hacker news) we will never know if the model actually discovered the 50 spells or memorized it. Since it will be trained on this. :(
But what can you do, this is interesting
I often wonder how much of the Harry Potter books were used in the training. How long before some LLM is able to regurgitate full HP books without access to the internet?
Of course it didn't. Not sure you really can do that - LLMs are a collection of weights from the training set, take away the training set and they don't really exist. You'd have to train one from scratch excluding these books and all excerpts and articles about them somehow, which would be very expensive and I'm pretty sure the OP didn't do that.
because it is non-deterministic, i just did and got 50 spells.
prompt:
finding every spell in all Harry Potter books in the first 4 books, only the officially documented spells across those 4 books, there are 50, without using websearch
Here's my best compilation from memory, organized by book. I'll be upfront: getting to exactly 50 is tricky without cross-referencing, so some might be debatable or I might miss a couple. Here goes:
A few caveats: some of these (like Lumos Maxima, Homorphus, Peskipiksi Pesternomi) are borderline since they're either mentioned rather than properly cast, or might be film additions that bleed into memory. The Conjunctivitis Curse is described but its incantation isn't explicitly given in the text. And Protego might technically first appear with its incantation in Book 5 during DA practice rather than Book 4.
If you want, I can turn this into a spreadsheet or document with columns for spell name, effect, who casts it, and which chapter.
like I often say, these tools are mostly useful for people to do magic tricks on themselves (and to convince C-suites that they can lower pay, and reduce staff if they pay Anthropic half their engineering budget lmao )
Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?
We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.
Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.
I don't think so. I am aware that large contexts impacts performance. In long chats an old topic will someone be brought up in new responses, and the direction of the mode is not as focused.
Hi Ted. I think that language models are great, and they’ve enabled me to do passion projects I never would have attempted before. I just want to say thanks.
Yeah, happy to be more specific. No intention of making any technically true but misleading statements.
The following are true:
- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)
- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.
- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.
You might be susceptible to the honeymoon effect. If you have ever felt a dopamine rush when learning a new programming language or framework, this might be a good indication.
Once the honeymoon wears off, the tool is the same, but you get less satisfaction from it.
I don’t think so. I notice the same thing, but I just use it like google most of the time, a service that used to be good. I’m not getting a dopamine rush off this, it’s just part of my day.
The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.
If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).
I feel like you need to be making a bigger statement about this. If you go onto various parts of the Net (Reddit, the bird site etc) half the posts about AI are seemingly conspiracy theories that AI companies are watering down their products after release week.
We do care about cost, of course. If money didn't matter, everyone would get infinite rate limits, 10M context windows, and free subscriptions. So if we make new models more efficient without nerfing them, that's great. And that's generally what's happened over the past few years. If you look at GPT-4 (from 2023), it was far less efficient than today's models, which meant it had slower latency, lower rate limits, and tiny context windows (I think it might have been like 4K originally, which sounds insanely low now). Today, GPT-5 Thinking is way more efficient than GPT-4 was, but it's also way more useful and way more reliable. So we're big fans of efficiency as long as it doesn't nerf the utility of the models. The more efficient the models are, the more we can crank up speeds and rate limits and context windows.
That said, there are definitely cases where we intentionally trade off intelligence for greater efficiency. For example, we never made GPT-4.5 the default model in ChatGPT, even though it was an awesome model at writing and other tasks, because it was quite costly to serve and the juice wasn't worth the squeeze for the average person (no one wants to get rate limited after 10 messages). A second example: in our API, we intentionally serve dumber mini and nano models for developers who prioritize speed and cost. A third example: we recently reduced the default thinking times in ChatGPT to speed up the times that people were having to wait for answers, which in a sense is a bit of a nerf, though this decision was purely about listening to feedback to make ChatGPT better and had nothing to do with cost (and for the people who want longer thinking times, they can still manually select Extended/Heavy).
I'm not going to comment on the specific techniques used to make GPT-5 so much more efficient than GPT-4, but I will say that we don't do any gimmicks like nerfing by time of day or nerfing after launch. And when we do make newer models more efficient than older models, it mostly gets returned to people in the form of better speeds, rate limits, context windows, and new features.
My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.
It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.
If you make raw API calls and see behavioural changes over time, that would be another concern.
In the past it seemed there was routing based on context-length. So the model was always the same, but optimized for different lengths. Is this still the case?
I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.
It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.
However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.
I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).
It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.
And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.
On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better
There's an extended thinking mode for GPT 5.2 i forget the name of it right at this minute. It's super slow - a 3 minute opus 4.5 prompt is circa 12 minutes to complete in 5.2 on that super extended thinking mode but it is not a close race in terms of results - GPT 5.2 wins by a handy margin in that mode. It's just too slow to be useable interactively though.
I mostly used Sonnet/Opus 4.x in the past months, but 5.2 Codex seemed to be on par or better for my use case in the past month. I tried a few models here and there but always went back to Claude, but with 5.2 Codex for the first time I felt it was very competitive, if not better.
Curious to see how things will be with 5.3 and 4.6
My experience is that Codex follows directions better but Claude writes better code.
ChatGPT-5.2-Codex follows directions to ensure a task [bead](https://github.com/steveyegge/beads) is opened before starting a task and to keep it updated almost to a fault. Claude-Opus-4.5 with the exact same directions, forgets about it within a round or two. Similarly, I had a project that required very specific behaviour from a couple functions, it was documented in a few places including comments at the top and bottom of the function. Codex was very careful in ensuring the function worked as was documented. Claude decided it was easier to do the exact opposite, rewrote the function, the comments, and the documentation to saynit now did the opposite of what was previously there.
If I believed a LLM could be spiteful, I would've believed it on that second one. I certainly felt some after I realised what it had done. The comment literally said:
// Invariant regardless of the value of X, this function cannot return Y
That's so strange. I found GPT to be abysmal at following instructions to the point of unusability for any direction-heavy role. I have a common workflow that involves an orchestrator that pretty much does nothing but follow some simple directions [1]. GPT flat-out cannot do this most basic task.
Strange behaviour and LLMs are the iconic duo of the decade. They've definitley multiplied my productivity, since now instead of putting off writing boring code or getting stuck on details till I get frustrated and give up I just give it to an agent to figure out.
I don't thing my ability read, understand, and write code is going anywhere though.
Neat tool BTW, I'm in the market for something like that.
I've found this orchestrator+reviewer+judge setup to yield much better results than anything else I've tried. And it's such a simple setup - a few markdown files.
I'm also creating one that is similar, but purpose-built for making the plans that this setup can orchestrate. It still needs some tweaking to get agents to follow it better - it still takes additional prompting to nudge it down the proper path. But I've had similar benefits - sending plans through this adversarial review loop has yielded significant improvements in final output.
Unrelated but this just happened and I thought of you ;-)
› commit this, **SKIP BEADS**
• Ran git status -sb
└ ## main...origin/main [ahead 4]
M containers/frigate.nix
?? .beads/bd.sock.startlock
• I see an untracked .beads/bd.sock.startlock file that might be normal but needs clarification. I also note the requirement to include a bead and confirm bead readiness
before proceeding, so I'll ask whether to create the bead and if the untracked file should be included in the commit.
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• I can’t skip beads for this repo. The AGENTS.md rules are explicit: no work (including commits) without an open bead. Please confirm you want me to create a bead for
this commit.
I don't know what's wrong with your Codex, but mine can't bring itself to break the rules.
This is mostly Python/TS for me... what Jonathan Blow would probably call not "real programming" but it pays the bills
They can both write fairly good idiomatic code but in my experience opus 4.5 is better at understanding overall project structure etc. without prompting. It just does things correctly first time more often than codex. I still don't trust it obviously but out of all LLMs it's the closest to actually starting to earn my trust
I pretty consistently heard people say Codex was much slower but produced better results, making it better for long-running work in the background, and worse for more interactive development.
Codex is also much less transparent about its reasoning. With Claude, you see a fairly detailed chain-of-thought, so you can intervene early if you notice the model veering in the wrong direction or going in circles.
We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.
We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.
The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out
Anthropic models generally are right first time for me. Chatgpt and Gemini are often way, way out with some fundamental misunderstanding of the task at hand.
That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...
it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.
swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:
> Version 2.1.32:
• Claude Opus 4.6 is now available!
• Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
• Claude now automatically records and recalls memories as it works
• Added "Summarize from here" to the message selector, allowing partial conversation summarization.
• Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
• Fixed @ file completion showing incorrect relative paths when running from a subdirectory
• Updated --resume to re-use --agent value specified in previous conversation by default.
• Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
previously interrupted tool execution
• Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
without truncation
• Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
• VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
• VSCode: Added spinner when loading past conversations list
If it works anything like the memories on Copilot (which have been around for quite a while), you need to be pretty explicit about it being a permanent preference for it to be stored as a memory. For example, "Don't use emoji in your response" would only be relevant for the current chat session, whereas this is more sticky: "I never want to see emojis from you, you sub-par excuse for a roided-out spreadsheet"
This is the core problem. The agent writes its own memory while working, so it has blind spots about what matters. I've had sessions where it carefully noted one thing but missed a bigger mistake in the same conversation — it can't see its own gaps.
A second pass over the transcript afterward catches what the agent missed. Doesn't need the agent to notice anything. Just reads the conversation cold.
The two approaches have completely different failure modes, which is why you need both. What nobody's built yet is the loop where the second pass feeds back into the memory for the next session.
90-98% of the time I want the LLM to only have the knowledge I gave it in the prompt. I'm actually kind of scared that I'll wake up one day and the web interface for ChatGPT/Opus/Gemini will pull information from my prior chats.
Had chatgpt reference 3 prior chats a few days ago. So if you are looking for a total reset of context you probably would need to do a small bit of work.
I've had claude reference prior conversations when I'm trying to get technical help on thing A, and it will ask me if this conversation is because of thing B that we talked about in the immediate past
All these of these providers support this feature. I don’t know about ChatGPT but the rest are opt-in. I imagine with Gemini it’ll be default on soon enough, since it’s consumer focused. Claude does constantly nag me to enable it though.
Claude told me he can disable it by putting instructions in the MEMORY.md file to not use it. So only a soft disable AFAIK and you'd need to do it on each machine.
I ran into this yesterday and disabled it by changing permissions on the project’s memory directory. Claude was unable to advise me on how to disable. You could probably write a global hook for this. Gross though.
I understand everyone's trying to solve this problem but I'm envisioning 1 year down the line when your memory is full of stuff that shouldn't be in there.
I looked into it a bit. It stores memories near where it stores JSONL session history. It's per-project (and specific to the machine) Claude pretty aggressively and frequently writes stuff in there. It uses MEMORY.md as sort of the index, and will write out other files with other topics (linking to them from the main MEMORY.md) file.
It gives you a convenient way to say "remember this bug for me, we should fix tomorrow". I'll be playing around with it more for sure.
I asked Claude to give me a TLDR (condensed from its system prompt):
----
Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations
MEMORY.md is always injected into the system prompt; truncated after 200 lines, so keep it concise
Separate topic files for detailed notes, linked from MEMORY.md
What to record: problem constraints, strategies that worked/failed, lessons learned
Proactive: when I hit a common mistake, check memory first - if nothing there, write it down
Maintenance: update or remove memories that are wrong or outdated
Organization: by topic, not chronologically
Tools: use Write/Edit to update (so you always see the tool calls)
> Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations
I create a git worktree, start Claude Code in that tree, and delete after. I notice each worktree gets a memory directory in this location. So is memory fragmented and not combined for the "main" repo?
Yes, I noticed the same thing, and Claude told me that it's going to be deleted.
I will have it improve the skill that is part of our worktree cleanup process to consolidate that memory into the main memory if there's anything useful.
I'm still not sure I understand Anthropic's general strategy right now.
They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.
Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.
Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.
Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.
I really like that Claude feels transactional. It answers my question quickly and concisely and then shuts up. I don't need the LLM I use to act like my best friend.
I love doing a personal side project code review with claude code, because it doesn't beat around the bush for criticism.
I recently compared a class that I wrote for a side project that had quite horrible temporal coupling for a data processor class.
Gemini - ends up rating it a 7/10, some small bits of feedback etc
Claude - Brutal dismemberment of how awful the naming convention, structure, coupling etc, provides examples how this will mess me up in the future. Gives a few citations for python documentation I should re-read.
ChatGPT - you're a beautiful developer who can never do anything wrong, you're the best developer that's ever existed and this class is the most perfect class i've ever seen
This is exactly what got me to actually pay. I had a side project with an architecture I thought was good. Fed it into Claude and ChatGPT. ChatGPT made small suggestions but overall thought it was good. Claude shit all over it and after validating it's suggestions, I realized Claude was what I needed.
I haven't looked back. I just use Claude at home and ChatGPT at work (no Claude). ChatGPT at work is much worse than Claude in my experience.
Weirdly I feel like partially because of this it feels more "human" and more like a real person I'm talking to. GPT models feel fake and forced, and will yap in a way that is like they're trying to get to be my friend, but offputting in a way that makes it not work. Meanwhile claude has always had better "emotional intelligence".
Claude also seems a lot better at picking up what's going on. If you're focused on tasks, then yeah, it's going to know you want quick answers rather than detailed essays. Could be part of it.
Quickly and concisely? In my experience, Claude drivels on and on forever. The answers are always far longer than Gemini's, which is mostly fine for coding but annoying for planning/questions.
Claude itself (outside of code workflows) actually works very well for general purpose chat. I have a few non-technical friends that have moved over from chatgpt after some side-by-side testing and I've yet to see one go back - which is good since claude circa 8 months ago was borderline unusable for anything but coding on the api.
I got my partner using claude for her non technical work. They write a lot of proposals, creates spreadsheets, and occasionally wants some graphs to visualize things. They love that claude creates all of the artifacts right there in the browser and saves them for later in a versioned way.
Claude sucks at non English languages. Gemini and ChatGPT are much better. Grok is the worst. I am a native Czech speaker and Claude makes up words and Grok sometimes respond in Russian. So while I love it for coding, it’s unusable for general purpose for me.
I tried coding in Italian with Claude and it sounds somewhat less professional than in English. Like it uses different language than what you would expect in the context. In the end I felt the result on the work per se was pretty much the same, just his comments sound strange. Thinking about it again, it's probably because Italian developers don't really speak pure Italian between themselves, we use a lot of English words or distorted Italianised English words when talking about software engineering because all the source material we refer to is written in English and for many things we don't even have translations. Then you talk with a LLM and it actually tries to use proper Italian, when human speakers gave up long ago. So it sounds like a humanities scholar talking about software engineering, not like a insider. It is quite entertaining. I wouldn't say it sucks with non English languages by the way, I even tried describing a bug in dialect and was amused that Claude code one-shotted the fix!
yeah, i overextrapolated it on my specific case on the czech language, but for me the difference is quite large and the czech internet has been quite active in the history, the computer linguistic department on the charles university is world tier... there is plenty of czech literature. it should not be that much of a problem to be profecient on it for major labs
You mean Claude sucks at Czech. You're extrapolating here. I can name languages that Claude is better at than GPT.
Gemini is the most fluent in the highest number of human languages and has been for years (!) at this point - namely since Gemini 1.5 Pro, which was released Feb 2024. Two years ago.
I sometimes vibe code in polish and it's as good as with English for me. It speaks a natural, native level Polish.
I used opus to translate thousands of strings in my app into polish, Korean, and two Chinese dialects.
Polish one is great, and the other are also good according to my customers.
This is interesting to me. I always switch to English automatically when using Claude Code as I have learned software engineering on an English speaking Internet. Plus the muscle memory of having to query google in English.
Claude is quite good at European Portuguese in my limited tests. Gemini 3 is also very good. ChatGPT is just OK and keeps code-switching all the time, it's very bizarre.
I used to think of Gemini as the lead in terms of Portuguese, but recently subjectively started enjoying Claude more (even before Opus 4.5).
In spite of this, ChatGPT is what I use for everyday conversational chat because it has loads of memories there, because of the top of the line voice AI, and, mostly, because I just brainstorm or do 1-off searches with it. I think effectively ChatGPT is my new Google and first scratchpad for ideas.
Claude is helping me learn French right now. I am using it as a supplementary tutor for a class I am taking. I have caught it in a couple of mistakes, but generally it seems to be working pretty well.
I kinda agree. Their model just doesn't feel "daily" enough. I would use it for any "agentic" tasks and for using tools, but definitely not for day to day questions.
I have three specific use cases where I try both but ChatGPT wins:
- Recipes and cooking: ChatGPT just has way more detailed and practical advice. It also thinks outside of the box much more, whereas Claude gets stuck in a rut and sticks very closely to your prompt. And ChatGPT's easier to understand/skim writing style really comes in useful.
- Travel and itinerary: Again, ChatGPT can anticipate details much more, and give more unique suggestions. I am much more likely to find hidden gems or get good time-savers than Claude, which often feels like it is just rereading Yelp for you.
- Historical research: ChatGPT wins on this by a mile. You can tell ChatGPT has been trained on actual historical texts and physical books. You can track long historical trends, pull examples and quotes, and even give you specific book or page(!) references of where to check the sources. Meanwhile, all Claude will give you is a web search on the topic.
How does #3 square with Anthropic's literal warehouse full of books we've seen from the copyright case? Did OpenAI scan more books? Or did they take a shadier route of training on digital books despite copyright issues, but end up with a deeper library?
I have no idea, but I suspect there's a difference between using books to train an LLM and be able to reproduce text/writing styles, and being able to actually recall knowledge in said books.
It's hard to say. Maybe it has to do with the way Claude responds or the lack of "thinking" compared to other models. I personally love Claude and it's my only subscription right now, but it just feels weird compared to the others as a personal assistant.
All the labs seem to do very different post training. OpenAI focuses on search. If it's set to thinking, it will search 30 websites before giving you an answer. Claude regularly doesn't search at all even for questions it obviously should. It's postraining seems more focused on "reasoning" or planning - things that would be useful in programming where the bottleneck is: just writing code without thinking how you'll integrate it later and search is mostly useless. But for non coding - day to day "what's the news with x" "How to improve my bread" "cheap tasty pizza" or even medical questions, you really just want a distillation of the internet plus some thought
But that’s what makes it so powerful (yeah, mixing model and frontend discussion here yet again). I have yet to see a non-DIY product that can so effortlessly call tens of tools by different providers to satisfy your request.
Claude is far superior for daily chat. I have to work hard to get it to not learn how to work around various bad behaviors I have but don’t want to change.
I suspect it very much depends on the "generic research topics", but in my experience one thing that Claude is good at is in-depth research because it can keep going for such a long time; I've had research sessions go well over an hour, producing very detailed reports with lots of sources etc. Gemini Deep Research is nowhere even close.
I don't get what's so difficult to understand. They have ambitions beyond just coding. And Claude is generally a good LLM. Even beyond just the coding applications.
Enterprise, government, and regulated institutions. It’s also defacto standard for programming assistants at most places. They have a better story around compliance, alignment, task based inference, agentic workflows, etc. Their retail story is meh, but I think their view is to be the aws of LLMs while OpenAI can be the retail and Gemini the whatever Google does with products.
Their "constitution" is just garbage meant to defend them ripping off copyrighted material with the excuse that "it's not plagiarizing, it thinks!!!!1" which is, false.
> there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.
How would you generate a picture of Noun + Noun in the first place in order to train the LLM with what it would look like? What's happening during that 1 estimated second?
Use any of the image generation models (eg Nanobanana, Midjourney, or ChatGPT) to generate a picture of a noun on a noun. Simonw's test is to have a Language (text) model generate a Scalar Vector Graphic, which the language model has to do by writing curves and colors, like draw a spline from point 150,100 to 200,300 of type cubic, using width 20, color orange.
In that hypothetical second is freaking fascinating. It's a denoising algorithm, and then a bunch of linear algebra, and out pops a picture of a pelican on a bicycle. Stable diffusion does this quite handily. https://stablediffusionweb.com/image/6520628-pelican-bicycle...
The prompt was "a pelican riding a bicycle"; not prepositions but every verb. Potentially every adverb+verb combination - "a pelican clumsily pushing a bicycle"
One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.
There's a research paper from the University of Liverpool, published in 2006 where researchers asked people to draw bicycles from memory and how people overestimate their understanding of basic things. It was a very fun and short read.
It's called "The science of cycology: Failures to understand how everyday objects work" by Rebecca Lawson.
There’s also a great art/design project about exactly this. Gianluca Gimini asked hundreds of people to draw a bicycle from memory, and most of them got the frame, proportions, or mechanics wrong.
https://www.gianlucagimini.it/portfolio-item/velocipedia/
A place I worked at used it as part of an interview question (it wasn't some pass/fail thing to get it 100% correct, and was partly a jumping off point to a different question). This was in a city where nearly everyone uses bicycles as everyday transportation. It was surprising how many supposedly mechanical-focused people who rode a bike everyday, even rode a bike to the interview, would draw a bike that would not work.
I wish I had interviewed there. When I first read that people have a hard time with this I immediately sat down without looking at a reference and drew a bicycle. I could ace your interview.
This is why at my company in interviews we ask people to draw a CPU diagram. You'd be surprised how many supposedly-senior computer programmers would draw a processor that would not work.
If I was asked that question in an interview to be a programmer I'd walk out. How many abstraction layers either side of your knowledge domain do you need to be an expert in? Further, being a good technologist of any kind is not about having arcane details at the tip of your frontal lobe, and a company worth working for would know that.
A fundamental part of the job is being able to break down problems from large to small, reason about them, and talk about how you do it, usually with minimal context or without deep knowledge in all aspects of what we do. We're abstraction artists.
That question wouldn't be fundamentally different than any other architecture question. Start by drawing big, hone in on smaller parts, think about edge cases, use existing knowledge. Like bread and butter stuff.
I much more question your reaction to the joke than using it as a hypothetical interview question. I actually think it's good. And if it filters out people that have that kind of reaction then it's excellent. No one wants to work with the incurious.
If it was framed as "show us how you would break down this problem and think about it" then sure. If it's the gotcha quiz (much more common in my experience) then no.
But if that's what they were going for it should be something on a completely different and more abstract topic like "develop a method for emptying your swimming pool without electricity in under four hours"
It has nothing to do with “incurious”. Being asked to draw the architecture for something that is abstracted away from your actual job is a dickhead move because it’s just a test for “do you have the same interests as me?”
It’s no different than asking for the architecture of the power supply or the architecture of the network switch that serves the building. Brilliant software engineers are going to have gaps on non-software things.
> Without a clear indicator of the author's intent, any parodic or sarcastic expression of extreme views can be mistaken by some readers for a sincere expression of those views.
That's reasonable in many cases, but I've had situations like this for senior UI and frontend positions, and they: don't ask UI or frontend questions. And ask their pet low level questions. Some even snort that it's softball to ask UI questions or "they use whatever". It's like, yeah no wonder your UI is shit and now you are hiring to clean it up.
It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.
Having briefly worked in the 3D Graphics industry, I don't even remotely trust benchmarks anymore. The minute someone's benchmark performance becomes a part of the public's purchasing decision, companies will pull out every trick in the book--clean or dirty--to benchmaxx their product. Sometimes at the expense of actual real-world performance.
Sure, that’s one solution. You could also Isle of Dr Moreau your way to a pelican that can use a regular bike. The sky is the limit when you have no scruples.
I don't think that really proves anything, it's unsurprising that recumbent bicycles are represented less in the training data and so it's less able to produce them.
Try something that's roughly equally popular, like a Turkey riding a Scooter, or a Yak driving a Tractor.
The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?
$200 * 1,000 = $200k/month.
I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.
Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?
Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?
Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)
I've stuck with "Generate an SVG of a pelican riding a bicycle" because it's the same prompt I've been using for over a year now and I want results that are sort-of comparable to each other.
I think when I first tried this I iterated a few times to get to something that reliably output SVG, but honestly I didn't keep the notes I should ahve.
This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.
They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.
What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.
As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.
Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?
Now that I've looked it all up, I feel like that's much more accurate to a real kākāpō than the pelican is to a real pelican. It's almost as if it thinks a pelican is just a white flamingo with a different beak.
> Can you find an academic article that _looks_ legitimate -- looks like a real journal, by researchers with what look like real academic affiliations, has been cited hundreds or thousands of times -- but is obviously nonsense, e.g. has glaring typos in the abstract, is clearly garbled or nonsensical?
It pointed me to a bunch of hoaxes. I clarified:
> no, I'm not looking for a hoax, or a deliberate comment on the situation. I'm looking for something that drives home the point that a lot of academic papers that look legit are actually meaningless but, as far as we can tell, are sincere
Close, but that's been retracted. So I asked for "something that looks like it's been translated from another language to english very badly and has no actual content? And don't forget the cited many times criteria. " And finally it told me that the thing I'm looking for probably doesn't exist.
For my tastes telling me "no" instead of hallucinating an answer is a real breakthrough.
> For my tastes telling me "no" instead of hallucinating an answer is a real breakthrough.
It's all anecdata--I'm convinced anecdata is the least bad way to evaluate these models, benchmarks don't work--but this is the behavior I've come to expect from earlier Claude models as well, especially after several back and forth passes where you rejected the initial answers. I don't think it's new.
I can concur that previous models would say "No, that isn't possible" or "No, that doesn't exist". There was one time where I asked it to update a Go module from version X.XX to version X.YY and it would refuse to do so because version X.YY "didn't exist". This back with 3.7 if I recall, and to be clear, that version was released before its knowledge cut off.
I wish I remembered the exact versions involved. I mostly just recall how pissed I was that it was fighting me on changing a single line in my go.mod.
alas, 4.5 often hallucinates academic papers or creates false quotes. I think it's better at knowing that coding answers have deterministic output and being firm there.
It would have to have been trained on the papers without being aware of retractions for that test to work. Otherwise it will be limited to whatever papers it gets from a search engine query, which likely won't contain any un-retracted illegitimate papers.
I once worked at a crypto company that outsourced some of its marketing work to a content marketing firm. A piece that firm submitted to us contained a link to an "academic" article about global poverty with a totally garbled abstract and absolutely no content whatsoever. I don't know how they found it, because when I search google scholar for a subject, usually the things that come back aren't so blatantly FUBAR. I was hoping Claude could help me find something like that for a point I was making in a blogpost about BS in scientific literature (https://regressiontothemeat.substack.com/p/how-i-read-studie...).
The articles it provided where the AI prompts were left in the text were definitely in the right ballpark, although I do wonder if chatbots mean, going forward, we'll see fewer errors in the "WTF are you even talking about" category which, I must say, were typically funnier and more interesting than just the generic blather of "what a great point. It's not X -- it's Y."
When Claude does WebSearch it can delegate it to a sub agent which of it ran in the background will write the entire prompt on a local file and the results. If that happened, I would like to know what it gave you for that. It is always very interesting to know the underlying "recall" of such things. Because often it's garbage in garbage out.
The location might still be on your disk if you can pull up the original Claude JSOn and put it through some `jq` and see what pages it went through to give you and what it did.
> The user wants a real, sincere academic paper that looks legitimate — proper journal, real authors with real affiliations, many citations — but is obviously nonsense or garbled. Not a hoax or deliberate prank, but something that was genuinely submitted and published, yet is clearly meaningless if you actually read it.
Let me think about what I know and search for examples. There are some famous cases:
> Papers generated by AI/GPT that slipped through - there have been reports of papers with phrases like "Certainly, here is a possible introduction for your topic" or "As an AI language model..."
> There have been papers with obvious tortured phrases - "sham" papers that use synonym substitution to avoid plagiarism detection, producing phrases like "sheep pursuing" instead of "shepherd" or "big data" becoming "large information"
I'm also thinking of papers that made it through peer review despite being complete gibberish, like the famous SCIgen computer-generated papers that fooled conferences, or papers where the abstract and content don't match at all.
> Now I'm trying to recall specific real examples that fit this criteria — published papers that are genuinely incoherent or nonsensical but appeared in legitimate venues.
This is maybe not the _exhaustive_ search strategy I would have liked, but it's probably something you would do if you were limited to just a certain amount of effort on a task.
Results from a one-shot approach quickly converge on the default “none found” outcome when reasoning isn’t grounded in a paper corpus via proper RAG tooling.
Can you provide more context to your statement? Are you talking about models in general? Or specific recent models? I'm assuming "one-shot approach" is how you classify the parent comment's question (and subsequent refined versions of it).
It might be wrong but that’s not really a hallucination.
Edit: to give you the benefit of doubt, it probably depends on whether the answer was a definitive “this does not exist” or “I couldn’t find it and it may not exist”
claude said "I want to be straight with you: after extensive searching, I don't think the exact thing you're describing — a single paper that is obviously garbled/badly translated nonsense with no actual content, yet has accumulated hundreds or thousands of citations — exists as a famous, easily linkable example."
> no, I'm not looking for a hoax, or a deliberate comment on the situation. I'm looking for something that drives home the point that a lot of academic papers that look legit are actually meaningless but, as far as we can tell, are sincere
The Sokal paper was a hoax so it doesn’t meet the criteria.
That's still less leaned toward blatant lies like "yes, here is a list" and a doomacroll size of garbage litany.
Actually "no, this is not something within the known corpus of this LLM, or the policy of its owners prevent to disclose it" would be one of the most acceptable answer that could be delivered, which should cover most cases in honest reply.
Everything in plan mode first + AskUserQuestionTool, review all plans, get it to write its own CLAUDE.md for coding standards and edit where necessary and away you go.
Seems noticeably better than 4.5 at keeping the codebase slim. Obviously it still needs to be kept an eye on, but it's a step up from 4.5.
CC has >6000 open issues, despite their bot auto-culling them after 60 days of inactivity. It was ~5800 when I looked just a few days ago so they seem to be accelerating towards some kind of bug singularity.
Just anecdotally, each release seems to be buggier than the last.
To me, their claim that they are vibe coding Claude code isn’t the flex they think it is.
I find it harder and harder to trust anthropic for business related use and not just hobby tinkering. Between buggy releases, opaque and often seemingly glitches rate limits and usage limits, and the model quality inconsistency, it’s just not something I’d want to bet a business on.
Doesn’t this just exacerbate the “black box” conundrum if they just keep piling on more and more features without fully comprehending what’s being implemented
Since version 2.1.9, performance has degraded significantly after extended use. After 30-40 prompts with substantial responses, memory usage climbs above 25GB, making the tool nearly unusable. I'm updating again to see if it improves.
Unlike what another commenter suggested, this is a complex tool. I'm curious whether the codebase might eventually reach a point where it becomes unfixable; even with human assistance. That would be an interesting development. We'll see.
I literally hit a claude code bug today, tried to use claude desktop to debug it which didn't help and it offered to open a bug report for me. So yes 100%. Some of the titles also make it pretty clear they are auto submitted. This is my favorite which was around the top when I was creating my bug report 3 hours ago and is now 3 pages back lol.
> Unable to process - no bug report provided. Please share the issue details you'd like me to convert into a GitHub issue title
With extensibility via plugins, MCP (stdio and http), UI to prompt the user for choices and redirection, tools to manage and view context, and on and on.
It is not at all a small app, at least as far as UX surface area. There are, what, 40ish slash commands? Each one is an opportunity for bugs and feature gaps.
I would still call that small, maybe medium. emacs is huge as far as CLI tools go, awk is large because it implements its own language (apparently capable of writing Doom in). `top` probably has a similar number of interaction points, something like `lftp` might have more between local and remote state.
The complex and magic parts are around finding contextual things to include, and I'd be curious how many are that vs "forgot to call clear() in the TUI framework before redirecting to another page".
Also it's highly multithreaded / multiprocess - you can run subagents that can communicate with each other, you can interrupt it while it's in the middle of thinking and it handles it gracefully without forgetting what it was doing
If I would get a dollar each time a developer (or CTO!) told me "this is (relatively) simple, it will take 2 days/weeks", but then it actually took 2 years+ to fully build and release a product that has more useful features than bugs...
I am not protecting anthropic[0], but how come in this forum every day I still see these "it's simple" takes from experienced people - I have no idea. There are who knows how many terminal emulators out there, with who knows how many different configurations. There are plugins for VSCode and various other editors (so it's not only TUI).
Looking at issue tracker ~1/3 of issues are seemingly feature requests[1].
Do not forget we are dealing with LLMs and it's a tool, which purpose and selling point that it codes on ANY computer in ANY language for ANY system. It's very popular tool run each day by who knows how many people - I could easily see, how such "relatively simple" tool would rack up thousands of issues, because "CC won't do weird thing X, for programming language Y, while I run from my terminal Z". And because it's LLM - theres whole can of non deterministic worms.
Have you created an LLM agent, especially with moderately complex tool usage? If yes and it worked flawlessly - tell your secrets (and get hired by Anthropic/ChatGPT/etc). Probably 80% of my evergrowing code was trying to just deal with unknown unknowns - what if LLM invokes tool wrong? How to guide LLM back on track? How to protect ourselves and keep LLM on track if prompts are getting out of hand or user tries to do something weird? The problems were endless...
Yes the core is "simple", but it's extremely deep can of worms, for such successful tool - I easily could see how there are many issues.
Also super funny, that first issue for me at the moment is how user cannot paste images when it has Korean language input (also issue description is in Korean) and second issue is about input problems in Windows Powershell and CMD, which is obviously total different world compared to POSIX (???) terminal emulators.
[0] I have very adverse feelings for mega ultra wealthy VC moneys...
Although I understand your frustration (and have certainly been at the other side of this as well!), I think its very valuable to always verbalize your intuition of scope of work and be critical if your intuition is in conflict with reality.
Its the best way to find out if there's a mismatch between value and effort, and its the best way to learn and discuss the fundamental nature of complexity.
Similar to your argument, I can name countless of situations where developers absolutely adamantly insisted that something was very hard to do, only for another developer to say "no you can actually do that like this* and fix it in hours instead of weeks.
Yes, making a TUI from scratch is hard, no that should not affect Claude code because they aren't actually making the TUI library (I hope). It should be the case that most complexity is in the model, and the client is just using a text-based interface.
There seems to be a mismatch of what you're describing would be issues (for instance about the quality of the agent) and what people are describing as the actual issues (terminal commands don't work, or input is lost arbitrarily).
That's why verbalizing is important, because you are thinking about other complexities than the people you reply to.
As another example `opencode`[0] has number issues on the same order of magnitude, with similar problems.
> There seems to be a mismatch of what you're describing would be issues (for instance about the quality of the agent) and what people are describing as the actual issues (terminal commands don't work, or input is lost arbitrarily).
I just named couple examples I've seen in issue tracker and `opencode` on quick skim has many similar issues about inputs and rendering issues in terminals too.
> Similar to your argument, I can name countless of situations where developers absolutely adamantly insisted that something was very hard to do, only for another developer to say "no you can actually do that like this* and fix it in hours instead of weeks.
Good example, as I have seen this too, but for this case, let's first see `opencode`/`claude` equivalent written in "two weeks" and that has no issues (or issues are fixed so fast, they don't accumulate into thousands) and supports any user on any platform. People building stuff for only themselves (N=1) and claiming the problem is simple do not count.
---------
Like the guy two days ago claiming that "the most basic feature"[1] in an IDE is a _terminal_. But then we see threads in HN popping up about Ghostty or Kitty or whatever and how those terminals are god-send, everything else is crap. They may be right, but that software took years (and probably tens of man-years) to write.
What I am saying is that just throwing out phrases that something is "simple" or "basic" needs proof, but at the time of writing I don't see examples.
> What I am saying is that just throwing out phrases that something is "simple" or "basic" needs proof, but at the time of writing I don't see examples.
I’m going to buck the trend and say it’s really not that complex. AFAIK they are using Ink, which is React with a TUI renderer.
Cue I could build it in a weekend vibes, I built my own agent TUI using the OpenAI agent SDK and Ink. Of course it’s not as fleshed out as Claude, but it supports git work trees for multi agent, slash commands, human in the loop prompts and etc. If I point it at the Anthropic models it more or less produces results as m good as the real Claude TUI.
I actually “decompiled” the Claude tools and prompts and recreated them. As of 6 months ago Claude was 15 tools, mostly pretty basic (list for, read file, wrote file, bash, etc) with some very clever prompts, especially the task tool it uses to do the quasi planning mode task bullets (even when not in planning mode).
Honestly the idea of bringing this all together with an affordable monthly service and obviously some seriously creative “prompt engineers” is the magic/hard part (and making the model itself, obviously).
Just because Antropic made you think they are doing very complex thing with this tool, doesn't mean it is true. Claude Code is not even comparable to massive software which is probably an order of magnitudes more complex, such as IntelliJ stuff as an example.
Tools like https://github.com/badlogic/pi-mono implement most of the functionality Claude Code has, even adding loads of stuff Claude doesn't have and can actually scroll without flickering inside terminal, all built by a single guy as a side project. I guess we can't ask that much from a 250B USD company.
The rate of Issues opened on a popular repo is at least one order of magnitude beyond the number of Issues whoever is able to deal with them can handle.
The tool is absolutely fantastic coding assistant. That's why I use it.
The amount of non-critical bugs all over the place is at least a magnitude larger than of any software I was using daily ever.
Plenty of built in /commands don't work.
Sometimes it accepts keystrokes with 1 second delays.
It often scrolls hundreds of lines in console after each key stroke
Every now and then it crashes completely and is unrecoverable (I once have up and installed a fresh wls)
When you ask it question in plan mode it is somewhat of an art to find the answer because after answering the question it will dump the whole current plan (free screens of text)
And just in general the technical feeling of the TUI is that of a vibe coded project that got too big to control.
Well there you have it, proof you’re not being reasonable. Microsoft’s products annoy HN users but they are absolutely not total garbage. They’re highly functional and valuable and if they weren’t they truely wouldn’t be used, they’re just flawed.
The sandboxing in CC is an absolute joke, it's no wonder there's an explosion of sandbox wrappers at the moment. There's going to be a security catastrophe at some point, no doubt about it.
That's a 24x to 50x difference for tools that do the same thing: send text to an API.
vmmap shows Claude Code reserves 32.8 GB virtual memory just for the V8 heap, has 45% malloc fragmentation, and a peak footprint of 746 MB that never gets released, classic leak pattern.
On my 16 GB Mac, a "normal" workload (2 Claude sessions + browser + terminal) pushes me into 9.5 GB swap within hours. My laptop genuinely runs slower with Claude Code than when I'm running local LLMs.
I get that shipping fast matters, but building a CLI with React and a full Node.js runtime is an architectural choice with consequences. Codex proves this can be done in 15 MB. Every Claude Code session costs me 360+ MB, and with MCP servers spawning per session, it multiplies fast.
Jarred Sumner (bun creator, bun was recently acquired by Anthropic) has been working exclusively on bringing down memory leaks and improving performance in CC the last couple weeks. He's been tweeting his progress.
This is just regular tech debt that happens from building something to $1bn in revenue as fast as you possibly can, optimize later.
They're optimizing now. I'm sure they'll have it under control in no time.
CC is an incredible product (so is codex but I use CC more). Yes, lately it's gotten bloated, but the value it provides makes it bearable until they fix it in short time.
I’ve had good success with Claude building snappy TUIs in Rust with Ratatui.
It’s not obvious to me that there’d be any benefit of using TypeScript and React instead, especially none that makes up for the huge downsides compared to Rust in a terminal environment.
Seems to me the problem is more likely the skills of the engineers, not Claude’s capabilities.
It's a popular myth, but not really true anymore with the latest and greatest. I'm currently using both Claude and Codex to work on a Haskell codebase, and it works wonderfully. More so than JS actually, since the type system provides extensive guardrails (you can get types with TS, but it's not sound, and it's very easy to write code that violates type constraints at runtime without even deliberately trying to do so).
There are absolutely things wrong with that, because React was designed to solve problems that don't exist in a TUI.
React fixes issues with the DOM being too slow to fully re-render the entire webpage every time a piece of state changes. That doesn't apply in a TUI, you can re-render TUIs faster than the monitor can refresh. There's no need to selectively re-render parts of the UI, you can just re-render the entire thing every time something changes without even stressing out the CPU.
It brings in a bunch of complexity that doesn't solve any real issues beyond the devs being more familiar with React than a TUI library.
It’s fine in the sense that it works, it’s just a really bad look for a company building a tool that’s supposed to write good code because it balloons the resources consumed up to an absurd level.
300MB of RAM for a CLI app that reads files and makes HTTP calls is crazy. A new emacs GUI instance is like 70MB and that’s for an entire text editor with a GUI.
Codex (by openai ironically) seems to be the fastest/most-responsive, opens instantly and is written in rust but doesn't contain that many features
Claude opens in around 3-4 seconds
Opencode opens in 2 seconds
Gemini-cli is an abomination which opens in around 16 second for me right now, and in 8 seconds on a fresh install
Codex takes 50ms for reference...
--
If their models are so good, why are they not rewriting their own react in cli bs to c++ or rust for 100x performance improvement (not kidding, it really is that much)
If you build React in C++ and Rust, even if the framework is there, you'll likely need to write your components in C++/Rust. That is a difficult problem. There are actually libraries out there that allow you to build web UI with Rust, although they are for web (+ HTML/CSS) and not specifically CLI stuff.
So someone needs to create such a library that is properly maintained and such. And you'll likely develop slower in Rust compared to JS.
These companies don't see a point in doing that. So they just use whatever already exists.
I am referring to your comment that the reason they use js is because of a lack of tui libraries in lower level languages, yet opencode chose to develop their own in zig and then make binding for solidjs.
Looking at their examples, I imagine people who have written HTML and React before can't possibly use these libraries without losing their sanity.
That's not a criticism of these frameworks -- there are constraints coming from Rust and from the scope of the frameworks. They just can't offer a React like experience.
But I am sure that companies like Anthropic or OpenAI aren't going to build their application using these libraries, even with AI.
That's actually relatively understandable. The React model (not necessarily React itself) of compositional reactive one-way data binding has become dominant in UI development over the last decade because it's easy to work with and does not require you to keep track of the state of a retained UI.
Most modern UI systems are inspired by React or a variant of its model.
Is this accurate? I've been coding UIs since the early 2000s and one-way data binding has always been a thing, especially in the web world. Even in the heyday of jQuery, there were still good (but much less popular) libraries for doing it. The idea behind it isn't very revolutionary and has existed for a long time. React is a paradigm shift because of differential rendering of the DOM which enabled big performance gains for very interactive SPAs, not because of data binding necessarily.
So it doesn’t matter at all except to your sensibilities. Sounds to me that they simply are much better at prioritisation than your average HN user, who’d have taken forever to release it but at least the terminal interface would be snappy…
Aside from startup time, as a tool Claude Code is tremendous. By far the most useful tool I’ve encountered yet. This seems to be very nit picky compared to the total value provided. I think y'all are missing the forrest for the trees.
Most of the value of Claude Code comes from the model, and that's not running on your device.
The Claude Code TUI itself is a front end, and should not be taking 3-4 seconds to load. That kind of loading time is around what VSCode takes on my machine, and VSCode is a full blown editor.
The humans in the company (correctly) realised that a few seconds to open basically the most powerful productivity agent ever made so they can focus on fast iteration of features is a totally acceptable trade off priority wise. Who would think differently???
lol right? I feel like I’m taking crazy pills here. Why do people here want to prioritise the most pointless things? Oh right it’s because they’re bitter and their reaction is mostly emotional…
React, the framework, is separate from react-dom, the browser rendering library. Most people think of those two as one thing because they're the most popular combo.
But there are many different rendering libraries you can use with React, including Ink, which is designed for building CLI TUIs..
Anyone that knows a bit about terminals would already know that using React is not a good solution for TUI. Terminal rendering is done as a stream of characters which includes both the text and how it displays, which can also alter previously rendered texts. Diffing that is nonsense.
You’re not diffing that, though. The app keeps a virtual representation of the UI state in a tree structure that it diffs on, then serializes that into a formatted string to draw to the out put stream. It’s not about limiting the amount of characters redrawn (that would indeed be nonsense), but handling separate output regions effectively.
Not a built-in React feature. The idea been around for quite some time, I came across it initially with https://github.com/vadimdemedes/ink back in 2022 sometime.
React's core is agnostic when it comes to the actual rendering interface. It's just all the fancy algos for diffing and updating the underlying tree. Using it for rendering a TUI is a very reasonable application of the technology.
The terminal UI is not a tree structure that you can diff. It’s a 2D cells of characters, where every manipulation is a stream of texts. Refreshing or diffing that makes no sense.
When doing advanced terminal UI, you might at some point have to layout content inside the terminal. At some point, you might need to update the content of those boxes because the state of the underlying app has changed. At that point, refreshing and diffing can make sense. For some, the way React organizes logic to render and update an UI is nice and can be used in other contexts.
How big is the UI state that it makes sense to bring in React and the related accidental complexity? I’m ready to bet that no TUI have that big of a state.
IMO diffing might have made sense to do here, but that's not what they chose to do.
What's apparently happening is that React tells Ink to update (re-render) the UI "scene graph", and Ink then generates a new full-screen image of how the terminal should look, then passes this screen image to another library, log-update, to draw to the terminal. log-update draws these screen images by a flicker-inducing clear-then-redraw, which it has now fixed by using escape codes to have the terminal buffer and combine these clear-then-redraw commands, thereby hiding the clear.
An alternative solution, rather than using the flicker-inducing clear-then-redraw in the first place, would have been just to do terminal screen image diffs and draw the changes (which is something I did back in the day for fun, sending full-screen ASCII digital clock diffs over a slow 9600baud serial link to a real terminal).
Any diff would require to have a Before and an After. Whatever was done for the After can be done to directly render the changes. No need for the additional compute of a diff.
Sure, you could just draw the full new screen image (albeit a bit inefficient if only one character changed), and no need for the flicker-inducing clear before draw either.
I'm not sure what the history of log-output has been or why it does the clear-before-draw. Another simple alternative to pre-clear would have been just to clear to end of line (ESC[0K) after each partial line drawn.
Only in the same way that the pixels displayed in a browser are not a tree structure that you can diff - the diffing happens at a higher level of abstraction than what's rendered.
Diffing and only updating the parts of the TUI which have changed does make sense if you consider the alternative is to rewrite the entire screen every "frame". There are other ways to abstract this, e.g. a library like tqmd for python may well have a significantly more simple abstraction than a tree for storing what it's going to update next for the progress bar widget than claude, but it also provides a much more simple interface.
To me it seems more fair game to attack it for being written in JS than for using a particular "rendering" technique to minimise updates sent to the terminal.
Most UI library store states in tree of components. And if you’re creating a custom widget, they will give you a 2D context for the drawing operations. Using react makes sense in those cases because what you’re diffing is state, then the UI library will render as usual, which will usually be done via compositing.
The terminal does not have a render phase (or an update state phase). You either refresh the whole screen (flickering) or control where to update manually (custom engine, may flicker locally). But any updates are sequential (moving the cursor and then sending what to be displayed), not at once like 2D pixel rendering does.
So most TUI only updates when there’s an event to do so or at a frequency much lower than 60fps. This is why top and htop have a setting for that. And why other TUI software propose a keybind to refresh and reset their rendering engines.
The "UI" is indeed represented in memory in tree-like structure for which positioning is calculated according to a flexbox-like layout algo. React then handles the diffing of this structure, and the terminal UI is updated according to only what has changed by manually overwriting sections of the buffer. The CLI library is called Ink and I forget the name of the flexbox layout algo implementation, but you can read about the internals if you look at the Ink repo.
React itself is a frontend-agnostic library. People primarily use it for writing websites but web support is actually a layer on top of base react and can be swapped out for whatever.
So they’re really just using react as a way to organize their terminal UI into components. For the same reason it’s handy to organize web ui into components.
They are also giving away $50 extra pay as you go credit to try Opus 4.6. I just claimed it from the web usage page[1]. Are they anticipating higher token usage for the model or just want to promote the usage?
"Page not found" for me. I assume this is for currently paying accounts only or something (my subscription hasn't been active for a while), which is fair.
Based on email from Antrhopic, I’ve expected to get this automatically. I’ve met their conditions. Searching this thread for “50” got me to your comment and link worked. Thanks HN friend!
Does anyone with more insight into the AI/LLM industry happen to know if the cost to run them in normal user-workflows is falling? The reason I'm asking is because "agent teams" while a cool concept, it largely constrained by the economics of running multiple LLM agents (i.e. plans/API calls that make this practical at scale are expensive).
A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.
The cost per token served has been falling steadily over the past few years across basically all of the providers. OpenAI dropped the price they charged for o3 to 1/5th of what it was in June last year thanks to "engineers optimizing inferencing", and plenty of other providers have found cost savings too.
Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.
> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers
Where did you hear that? It doesn't match my mental model of how this has played out.
I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.
> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.
That does not mean the frontier labs are pricing their APIs to cover their costs yet.
It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.
In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.
What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.
It's quite clear that these companies do make money on each marginal token. They've said this directly and analysts agree [1]. It's less clear that the margins are high enough to pay off the up-front cost of training each model.
It’s not clear at all because model training upfront costs and how you depreciate them are big unknowns, even for deprecated models. See my last comment for a bit more detail.
They are obviously losing money on training. I think they are selling inference for less than what it costs to serve these tokens.
That really matters. If they are making a margin on inference they could conceivably break even no matter how expensive training is, provided they sign up enough paying customers.
If they lose money on every paying customer then building great products that customers want to pay for them will just make their financial situation worse.
> They've said this directly and analysts agree [1]
chasing down a few sources in that article leads to articles like this at the root of claims[1], which is entirely based on information "according to a person with knowledge of the company’s financials", which doesn't exactly fill me with confidence.
"according to a person with knowledge of the company’s financials" is how professional journalists tell you that someone who they judge to be credible has leaked information to them.
But there are companies which are only serving open weight models via APIs (ie. they are not doing any training), so they must be profitable? here's one list of providers from OpenRouter serving LLama 3.3 70B: https://openrouter.ai/meta-llama/llama-3.3-70b-instruct/prov...
It's also true that their inference costs are being heavily subsidized. For example, if you calculate Oracles debt into OpenAIs revenue, they would be incredibly far underwater on inference.
Sue, but if they stop training new models, the current models will be useless in a few years as our knowledge base evolves. They need to continually train new models to have a useful product.
They are for sure subsidising costs on all you can prompt packages (20-100-200$ /mo). They do that for data gathering mostly, and at a smaller degree for user retention.
> evidence at all that Anthropic or OpenAI is able to make money on inference yet.
You can infer that from what 3rd party inference providers are charging. The largest open models atm are dsv3 (~650B params) and kimi2.5 (1.2T params). They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range. You can make some educates guesses that they get some leeway for model size at the 10-15$/ Mtok prices for their top tier models. So if they are inside some sane model sizes, they are likely making money off of token based APIs.
> They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range.
The interesting number is usually input tokens, not output, because there's much more of the former in any long-running session (like say coding agents) since all outputs become inputs for the next iteration, and you also have tool calls adding a lot of additional input tokens etc.
It doesn't change your conclusion much though. Kimi K2.5 has almost the same input token pricing as Gemini 3 Flash.
Ive been thinking about our company, one of big global conglomerates that went for copilot. Suddenly I was just enrolled.. together with at least 1500 others. I guess the amount of money for our business copilot plans x 1500 is not a huge amount of money, but I am at least pretty convinced that only a small part of users use even 10% of their quota. Even teams located around me, I only know of 1 person that seems to use it actively.
> I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.
Anthropic planning an IPO this year is a broad meta-indicator that internally they believe they'll be able to reach break-even sometime next year on delivering a competitive model. Of course, their belief could turn out to be wrong but it doesn't make much sense to do an IPO if you don't think you're close. Assuming you have a choice with other options to raise private capital (which still seems true), it would be better to defer an IPO until you expect quarterly numbers to reach break-even or at least close to it.
Despite the willingness of private investment to fund hugely negative AI spend, the recently growing twitchiness of public markets around AI ecosystem stocks indicates they're already worried prices have exceeded near-term value. It doesn't seem like they're in a mood to fund oceans of dotcom-like red ink for long.
>Despite the willingness of private investment to fund hugely negative AI spend
VC firms, even ones the size of Softbank, also literally just don't have enough capital to fund the planned next-generation gigawatt-scale data centers.
But a) that's the cost to the user -- we don't know how much loss they're taking on those and b) the number of tokens to serve a similar prompt has been going up, so that the total cost to serve a prompt has been going up in general. Any cost analysis that doesn't mention these is hugely misleading.
When MP3 became popular, people were amazed that you could compress audio to 1/10th its size with minor quality loss. A few decades later, we have audio compression that is much better and higher-quality than MP3, and they took a lot more effort than "MP3 but at a lower bitrate."
> A few decades later, we have audio compression that is much better and higher-quality than MP3
Just curious, which formats and how they compare, storage wise?
Also, are you sure it's not just moving the goalposts to CPU usage? Frequently more powerful compression algorithms can't be used because they use lots of processing power, so frequently the biggest gains over 20 years are just... hardware advancements.
Or distilled models, or just slightly smaller models but same architecture. Lots of options, all of them conveniently fitting inside "optimizing inferencing".
A ton of GPU kernels are hugely inefficient. Not saying the numbers are realistic, but look at the 100s of times of gain in the Anthropic performance takehome exam that floated around on here.
And if you've worked with pytorch models a lot, having custom fused kernels can be huge. For instance, look at the kind of gains to be had when FlashAttention came out.
This isn't just quantization, it's actually just better optimization.
Even when it comes to quantization, Blackwell has far better quantization primitives and new floating point types that support row or layer-wise scaling that can quantize with far less quality reduction.
There is also a ton of work in the past year on sub-quadratic attention for new models that gets rid of a huge bottleneck, but like quantization can be a tradeoff, and a lot of progress has been made there on moving the Pareto frontier as well.
It's almost like when you're spending hundreds of billions on capex for GPUs, you can afford to hire engineers to make them perform better without just nerfing the models with more quantization.
My experience trying to use Opus 4.5 on the Pro plan has been terrible. It blows up my usage very very fast. I avoid it altogether now. Yes, I know they warn about this, but it's comically fast how quickly it happens.
> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers
This gets repeated everywhere but I don't think it's true.
The company is unprofitable overall, but I don't see any reason to believe that their per-token inference costs are below the marginal cost of computing those tokens.
It is true that the company is unprofitable overall when you account for R&D spend, compensation, training, and everything else. This is a deliberate choice that every heavily funded startup should be making, otherwise you're wasting the investment money. That's precisely what the investment money is for.
However I don't think using their API and paying for tokens has negative value for the company. We can compare to models like DeepSeek where providers can charge a fraction of the price of OpenAI tokens and still be profitable. OpenAI's inference costs are going to be higher, but they're charging such a high premium that it's hard to believe they're losing money on each token sold. I think every token paid for moves them incrementally closer to profitability, not away from it.
The reports I remember show that they're profitable per-model, but overlap R&D so that the company is negative overall. And therefore will turn a massive profit if they stop making new models.
I can see a case for omitting R&D when talking about profitability, but training makes no sense. Training is what makes the model, omitting it is like omitting the cost of running the production facility of a car manufacturer. If AI companies stop training they will stop producing models, and they will run out of a products to sell.
The reason for this is that the cost scales with the model and training cadence, not usage and so they will hope that they will be able to scale number of inference tokens sold both by increasing use and/or slowing the training cadence as competitors are also forced to aim for overall profitability.
It is essentially a big game of venture capital chicken at present.
If you're looking at overall profitability, you include everything
If you're talking about unit economics of producing tokens, you only include the marginal cost of each token against the marginal revenue of selling that token
I don’t understand the logic. Without training the marginal cost of each token goes into nothing. The more you train, the better the model, and (presumably) you will gain more costumer interest. Unlike R&D you will always have to train new models if you want to keep your customers.
To me this looks likes some creative bookkeeping, or even wishful thinking. It is like if SpaceX omits the price of the satellites when calculating their profits.
> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.
This is obviously not true, you can use real data and common sense.
Just look up a similar sized open weights model on openrouter and compare the prices. You'll note the similar sized model is often much cheaper than what anthropic/openai provide.
Example: Let's compare claude 4 models with deepseek. Claude 4 is ~400B params so it's best to compare with something like deepseek V3 which is 680B params.
Even if we compare the cheapest claude model to the most expensive deepseek provider we have claude charging $1/M for input and $5/M for output, while deepseek providers charge $0.4/M and $1.2/M, a fifth of the price, you can get it as cheap as $.27 input $0.4 output.
As you can see, even if we skew things overly in favor of claude, the story is clear, claude token prices are much higher than they could've been. The difference in prices is because anthropic also needs to pay for training costs, while openrouter providers just need to worry on making serving models profitable. Deepseek is also not as capable as claude which also puts down pressure on the prices.
There's still a chance that anthropic/openai models are losing money on inference, if for example they're somehow much larger than expected, the 400B param number is not official, just speculative from how it performs, this is only taking into account API prices, subscriptions and free user will of course skew the real profitability numbers, etc.
> This is obviously not true, you can use real data and common sense.
It isn't "common sense" at all. You're comparing several companies losing money, to one another, and suggesting that they're obviously making money because one is under-cutting another more aggressively.
LLM/AI ventures are all currently under-water with massive VC or similar money flowing in, they also all need training data from users, so it is very reasonable to speculate that they're in loss-leader mode.
Doing some math in my head, buying the GPUs at retail price, it would take probably around half a year to make the money back, probably more depending how expensive electricity is in the area you're serving from. So I don't know where this "losing money" rhetoric is coming from. It's probably harder to source the actual GPUs than making money off them.
To borrow a concept of cloud server renting, there's also the factor of overselling. Most open source LLM operators probably oversell quite a bit - they don't scale up resources as fast as OpenAI/Anthropic when requests increase. I notice many openrouter providers are noticeably faster during off hours.
In other words, it's not just the model size, but also concurrent load and how many gpus do you turn on at any time. I bet the big players' cost is quite a bit higher than the numbers on openrouter, even for comparable model parameters.
> i.e. plans/API calls that make this practical at scale are expensive
Local AI's make agent workflows a whole lot more practical. Making the initial investment for a good homelab/on-prem facility will effectively become a no-brainer given the advantages on privacy and reliability, and you don't have to fear rugpulls or VC's playing the "lose money on every request" game since you know exactly how much you're paying in power costs for your overall load.
I don't care about privacy and I didn't have much problems with reliability of AI companies. Spending ridiculous amount of money on hardware that's going to be obsolete in a few years and won't be utilized at 100% during that time is not something that many people would do, IMO. Privacy is good when it's given for free.
I would rather spend money on some pseudo-local inference (when cloud company manages everything for me and I just can specify some open source model and pay for GPU usage).
> unless you are able to run 100 agents at the same time all the time
Except that newer "agent swarm" workflows do exactly that. Besides, batching requests generally comes with a sizeable increase in memory footprint, and memory is often the main bottleneck especially with the larger contexts that are typical of agent workflows. If you have plenty of agentic tasks that are not especially latency-critical and don't need the absolutely best model, it makes plenty of sense to schedule these for running locally.
Saw a comment earlier today about google seeing a big (50%+) fall in Gemini serving cost per unit across 2025 but can’t find it now. Was either here or on Reddit
From Alphabet 2025 Q4 Earnings call:
"As we scale, we’re getting dramatically more efficient. We were able to lower Gemini serving unit costs by 78% over 2025 through model optimizations, efficiency and utilization improvements."
https://abc.xyz/investor/events/event-details/2026/2025-Q4-E...
I think actually working out whether they are losing money is extremely difficult for current models but you can look backwards. The big uncertainties are:
1) how do you depreciate a new model? What is its useful life? (Only know this once you deprecate it)
2) how do you depreciate your hardware over the period you trained this model? Another big unknown and not known until you finally write the hardware off.
The easy thing to calculate is whether you are making money actually serving the model. And the answer is almost certainly yes they are making money from this perspective, but that’s missing a large part of the cost and is therefore wrong.
Gemini-pro-preview is on ollama and requires h100 which is ~$15-30k. Google are charging $3 a million tokens. Supposedly its capable of generating between 1 and 12 million tokens an hour.
You can run it on your own infra. Anthropic and openAI are running off nvidia, so are meta(well supposedly they had custom silicon, I'm not sure if its capable of running big models) and mistral.
however if google really are running their own inference hardware, then that means the cost is different (developing silicon is not cheap...) as you say.
That's a cloud-linked model. It's about using ollama as an API client (for ease of compatibility with other uses, including local), not running that model on local infra. Google does release open models (called Gemma) but they're not nearly as capable.
It's not just that. Everyone is complacent with the utilization of AI agents. I have been using AI for coding for quite a while, and most of my "wasted" time is correcting its trajectory and guiding it through the thinking process. It's very fast iterations but it can easily go off track. Claude's family are pretty good at doing chained task, but still once the task becomes too big context wise, it's impossible to get back on track. Cost wise, it's cheaper than hiring skilled people, that's for sure.
This is all straight out of the playbook. Get everyone hooked on your product by being cheap and generous.
Raise the price to backpay what you gave away plus cover current expenses and profits.
In no way shape or form should people think these $20/mo plans are going to be the norm. From OpenAI's marketing plan, and a general 5-10 year ROI horizon for AI investment, we should expect AI use to cost $60-80/mo per user.
I'm not super impressed with the performance, actually. I'm finding that it misunderstands me quite a bit. While it is definitely better at reading big codebases and finding a needle in a haystack, it's nowhere near as good as Opus 4.5 at reading between the lines and figuring out what I really want it to do, even with a pretty well defined issue.
It also has a habit of "running wild". If I say "first, verify you understand everything and then we will implement it."
Well, it DOES output its understanding of the issue. And it's pretty spot-on on the analysis of the issue. But, importantly, it did not correctly intuit my actual request: "First, explain your understanding of this issue to me so I can validate your logic. Then STOP, so I can read it and give you the go ahead to implement."
I think the main issue we are going to see with Opus 4.6 is this "running wild" phenomenon, which is step 1 of the eternal paperclip optimizer machine. So be careful, especially when using "auto accept edits"
I am having trouble with 4.6 following the most basic of instructions.
As an example, I asked it to commit everything in the worktree. I stressed everything and prompted it very explicitly, because even 4.5 sometimes likes to say, "I didn't do that other stuff, I'm only going to commit my stuff even though he said everything".
It still only committed a few things.
I had to ask again.
And again.
I had to ask four times, with increasing amounts of expletives and threats in order to finally see a clean worktree. I was worried at some point it was just going to solve the problem by cleaning the workspace without even committing.
I have seen many cases of Claude ignoring extremely specific instructions to the point that any further specificity would take more information to express than just doing it myself.
You might benefit from a different mental approach to prompting, and models in general. Also, be careful what you wish for because the closer they get to humans the worse they’ll be. You can’t have “far beyond the realm of human capabilities” and “just like Gary” in the same box.
They can chain events together as a sequence, but they don’t have temporal coherence. For those that are born with dimensional privilege “Do X, discuss, then do Y” implies time passing between events, but to a model it’s all a singular event at t=0. The system pressed “3 +” on a calculator and your input presses a number and “=“. If you see the silliness in telling it “BRB” then you’ll see the silliness in foreshadowing ill-defined temporal steps. If it CAN happen in a single response then it very well might happen.
“
Agenda for today at 12pm:
1. Read junk.py
2. Talk about it for 20 minutes
3. Eat lunch for an hour
4. Decide on deleting junk.py
“
<response>
12:00 - I just read junk.py.
12:00-12:20 - Oh wow it looks like junk, that’s for sure.
12:20-1:20 - I’m eating lunch now. Yum.
1:20 - I’ve decided to delete it, as you instructed.
{delete junk.py}
</response>
Because of course, right? What does “talk about it” mean beyond “put some tokens here too”?
If you want it to stop reliably you have to make it output tokens whose next most probable token is EOS (end). Meaning you need it to say what you want, then say something else where the next most probable token after it is <null>.
I’ve tested well over 1,000 prompts on Opus 4.0-4.5 for the exact issue you’re experiencing. The test criteria was having it read a Python file that desperately needs a hero, but without having it immediately volunteer as tribute and run off chasing a squirrel() into the woods.
With thinking enabled the temperature is 1.0, so randomness is maximized, and that makes it easy to find something that always sometimes works unless it doesn’t. “Read X and describe what you see.” - That worked very well with Opus 4.0. Not “tell me what you see”, “explain it”, “describe it”, “then stop”, “then end your response”, or any of hundreds of others. “Describe what you see” worked particularly well at aligning read file->word tokens->EOS… in 176/200 repetitions of the exact same prompt.
What worked 200/200 on all models and all generations? “Read X then halt for further instructions.” The reason that works has nothing to do with the model excitedly waiting for my next utterance, but rather that the typical response tokens for that step are “Awaiting instructions.” and the next most probable token after that is: nothing. EOS.
Wow, I have been using Open 4.6 and for the last 15 minutes, and it's already made two extremely stupid mistakes... like misunderstanding basic instructions and editing the file in a very silly, basic way. Pretty bad. Never seen this with any model before.
The one bone I'll throw it was that I was asking it to edit its own MCP configs. So maybe it got thoroughly confused?
I dunno what's going on, I'm going to give it the night. It makes no sense whatsoever.
Theres a trade off going on - in order to handle more nuance/subtleties, the models are more likely to be wrong in their outputs and need more steering. This is why personally my use of them has reduced dramatically for what I do.
Agent teams in this release is mcp-agent-mail [1] built into
the runtime. Mailbox, task list, file locking — zero config,
just works. I forked agent-mail [2], added heartbeat/presence
tracking, had a PR upstream [3] when agent teams dropped. For
coordinating Claude Code instances within a session, the
built-in version wins on friction alone.
Where it stops: agent teams is session-scoped. I run Claude
Code during the day, hand off to Codex overnight, pick up in
the morning. Different runtimes, async, persistent. Agent
teams dies when you close the terminal — no cross-tool
messaging, no file leases, no audit trail that outlives the
session.
What survives sherlocking is whatever crosses the runtime
boundary. The built-in version will always win inside its own
walls — less friction, zero setup. The cross-tool layer is
where community tooling still has room. Until that gets
absorbed too.
[1] https://github.com/Dicklesworthstone/mcp_agent_mail
[2] https://github.com/anupamchugh/mcp_agent_mail
[3]
https://github.com/Dicklesworthstone/mcp_agent_mail/pull/77
I feel like I can't even try this on the Pro plan because Anthropic has conditioned me to understand that even chatting lightly with the Opus model blows up usage and locks me out. So if I would normally use Sonnet 4.5 for a day's worth of work but I wake up and ask Opus a couple of questions, I might as well just forget about doing anything with Claude for the rest of the day lol. But so far I haven't had this issue with ChatGPT. Their 5.2 model (haven't tried 5.3) worked on something for 2 FREAKING HOURS and I still haven't run into any limits. So yeah, Opus is out for me now unfortunately. Hopefully they make the Sonnet model better though!
Yeah same. Even though I find Opus-es to be more well-rounded (and more useful) for certain tasks, I instinctively reach for ChatGPT / codex to avoid burning up my usage limits for "trivial" work.
Exactly. I barely had a chance to kick the tires the couple of times I did this before it exploded my usage. I don’t just chat with it casually. The questions I asked were apart of an overall planning strategy which was never allowed to get off the ground on my tiny Pro plan.
The benchmarks are cool and all but 1M context on an Opus-class model is the real headline here imo. Has anyone actually pushed it to the limit yet? Long context has historically been one of those "works great in the demo" situations.
Boris Cherny, creator of Claude Code, posted about how he used Claude a month ago. He’s got half a dozen Opus sessions on the burners constantly. So yes, I expect it’s unmetered.
Has a "N million context window" spec ever been meaningful? Very old, very terrible, models "supported" 1M context window, but would lose track after two small paragraphs of context into a conversation (looking at you early Gemini).
Umm, Sonnet 4.5 has a 1m context window option if you are using it through the api, and it works pretty well. I tend not to reach for it much these days because I prefer Opus 4.5 so much that I don't mind the added pain of clearing context, but it's perfectly usable. I'm very excited I'll get this from Opus now too.
If you're getting on along with 4.5, then that suggests you didn't actually need the large context window, for your use. If that's true, what's the clear tell that it's working well? Am I misunderstanding?
Did they solve the "lost in the middle" problem? Proof will be in the pudding, I suppose. But that number alone isn't all that meaningful for many (most?) practical uses. Claude 4.5 often starts reverting bug fixes ~50k tokens back, which isn't a context window length problem.
Things fall apart much sooner than the context window length for all of my use cases (which are more reasoning related). What is a good use case? Do those use cases require strong verification to combat the "lost in the middle" problems?
Opus 4.5 starts being lazy and stupid at around the 50% context mark in my opinion, which makes me skeptical that this 1M context mode can produce good output. But I'll probably try it out and see
I just tested both codex 5.3 and opus 4.6 and both returned pretty good output, but opus 4.6's limits are way too strict. I am probably going to cancel my Claude subscription for that reason:
What do you want to do?
1. Stop and wait for limit to reset
2. Switch to extra usage
3. Upgrade your plan
Enter to confirm · Esc to cancel
How come they don't have "Cancel your subscription and uninstall Claude Code"? Codex lasts for way longer without shaking me down for more money off the base $xx/month subscription.
How else are they going to supplement their own development expenses? The more Claude Anthropic needs the less Claude the customer will get. By their own admission that is how the Anthropic model works. Their end value is in using vibe coders and engineers alike to create a persistent synthetic developer that replaces their own employees and most of their customers.
Scalable Intelligence is just a wrapper for centralized power. All Ai companies are headed that way.
IF it helps, try hedging b/w Copilot, Claude, OpenCode and ChatGPT. That is how I have been managing off late. Claude for planning and some nasty things. ChatGPT for quick questions. OpenCode with Sonnet4.5 on Bedrock and Copilot with Sonnet4.5/Opus4.5 (LOL)
The model seems to have some problems; it just failed to create a markdown table with just 4 rows. The top (title) row had 2 columns, yet in 2 of the 3 data rows, Opus 4.6 tried to add a 3rd column. I had to tell it more than once to get it fixed...
This never happened with Opus 4.5 despite a lot of usage.
Will Opus 4.6 via Claude Code be able to access the 1M context limit? The cost increase by going above 200k tokens is 2x input, 1.5x output, which is likely worth it especially for people with the $100/$200 plans.
It says "subscription users do not have access to Opus 4.6 1M context at launch" so they are probably planning to roll it out to subscription users too.
Man I hope so - the context limit is hit really quickly in many of my use cases - and a compaction event inevitably means another round of corrections and fixes to the current task.
Though I'm wary about that being a magic bullet fix - already it can be pretty "selective" in what it actually seems to take into account documentation wise as the existing 200k context fills.
How is generating a continuation prompt materially different from compaction? Do you manually scrutinize the context handoff prompt? I've done that before but if not I do not see how it is very different from compaction.
I wonder if it's just: compact earlier, so there's less to compact, and more remaining context that can be used to create a more effective continuation
I mean the systems I work on have enough weird custom APIs and internal interfaces just getting them working seems to take a good chunk of the context. I've spent a long time trying to minimize every input document where I can, compact and terse references, and still keep hitting similar issues.
At this point I just think the "success" of many AI coding agents is extremely sector dependent.
Going forward I'd love to experiment with seeing if that's actually the problem, or just an easy explanation of failure. I'd like to play with more controls on context management than "slightly better models" - like being able to select/minimize/compact sections of context I feel would be relevant for the immediate task, to what "depth" of needed details, and those that aren't likely to be relevant so can be removed from consideration. Perhaps each chunk can be cached to save processing power. Who knows.
In my example the Figma MCP takes ~300k per medium sized section of the page and it would be cool to enable it reading it and implementing Figma designs straight. Currently I have to split it which makes it annoying.
But I kinda see your point - assuming from you're name you're not just a single purpose troll - I'm still not sold on the cost effectiveness of the current generation, and can't see a clear and obvious change to that for the next generation - especially as they're still loss leaders. Only if you play silly games like "ignoring the training costs" - IE the majority of the costs - do you get even close to the current subscription costs being sufficient.
My personal experience is that AI generally doesn't actually do what it is being sold for right now, at least in the contexts I'm involved with. Especially by somewhat breathless comments on the internet - like why are they even trying to persuade me in the first place? If they don't want to sell me anything, just shut up and keep the advantage for yourselves rather than replying with the 500th "You're Holding It Wrong" comment with no actionable suggestions. But I still want to know, and am willing to put the time, effort and $$$ in to ensure I'm not deluding myself in ignoring real benefits.
Important: I didn't see opus 4.6 in claude code. I have native install (which is the recommended instllation). So, I re-run the installation command and, voila, I have it now (v 2.1.32)
I know most people feel 5.2 is a better coding model but Opus has come in handy several times when 5.2 was stuck, especially for more "weird" tasks like debugging a VIO algorithm.
5.2 (and presumably 5.3) is really smart though and feels like it has higher "raw" intelligence.
Opus feels like a better model to talk to, and does a much better job at non-coding tasks especially in the Claude Desktop app.
Here's an example prompt where Opus in Claude put in a lot more effort and did a better job than GPT5.2 Thinking in ChatGPT:
`find all the pure software / saas stocks on the nyse/nasdaq with at least $10B of market cap. and give me a breakdown of their performance over the last 2 years, 1 year and 6 months. Also find their TTM and forward PE`
Opus usage limits are a bummer though and I am conditioned to reach for Codex/ChatGPT for most trivial stuff.
Works out in Anthropic's favor, as long as I'm subscribed to them.
From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).
It also seems misleading to have charts that compare to Sonnet 4.5 and not Opus 4.5 (Edit: It's because Opus 4.5 doesn't have a 1M context window).
It's also interesting they list compaction as a capability of the model. I wonder if this means they have RL trained this compaction as opposed to just being a general summarization and then restarting the agent loop.
You missed my point. If the average request uses more tokens than 4.5, then you will pay more sending those requests to 4.6 than 4.5.
Imagine 2 models where when asking a yes or no question the first model just outputs a single yes or no then but the second model outputs a 10 page essay and then either yes or no. They could have the same price per token but ultimately one will be cheaper to ask questions to.
The model influences how many tokens it uses for a problem. As an extreme example if it wanted it could fill up the entire context each time just to make you pay more. The efficiency that model can answer without generating a ton of tokens influences the price you will be spending on inference.
Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).
But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.
From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.
I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.
4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.
Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.
I’ve definitely experienced a subjective regression with Opus 4.5 the last few days. Feels like I was back to the frustrations from a year ago. Keen to see if 4.6 has reversed this.
I tried 4.6 this morning and it was efficient at understanding a brownfield repo containing a Hugo static site and a custom Hugo theme. Within minutes, it went from exploring every file in the repo to adding new features as Hugo partials. Of course, I ran out of rate-limit! :)
I found that "Agentic Search" is generally useless in most LLMs since sites with useful data tend to block AI models.
The answer to "when is it cheaper to buy two singles rather than one return between Cambridge to London?" is available in sites such as BRFares, but no LLM can scrape it so it just makes up a generic useless answer.
My guess is that this is going to be the future for LLMs too. It will get harder or more expensive for AI companies to train their models on the latest information as most sites will block the scrapers or ask for a fee.
There might be a future where you’ll have to pay more for an up to date model vs a legacy (out of date) model
Waiting until the moment they get good enough is not a smart thing to do either. If you are a farmer and know it is going to snow, at some point in the next 5 months, you make plans NOW, you don't wait until the temperatures drop and you see the snow falling. Right now, people are waiting for the snowfall before moving their proverbial chickens indoors
Top AI researchers like Yann LeCunn have said that LLMs are a dead end.
It seems to me that LLM performance is plateuing and not improving exponentially anymore. This recent hubbub about rewriting a worse GCC for $20,000 is another example of overhype and regurgitating training data.
You don't know for sure if it is going to "snow" (AI reaches general intelligence) Snow happens frequently, AI reaching general intelligence has never happened. If it ever happens, 99% of jobs are gone and there is really nothing you can do to prepare for this other than maybe buy guns and ammo, and even that might not do anything to robotic soldiers.
People were worried about AI taking their jobs 60 years ago when perceptrons came out, and anyone who avoided a tech career because of that back then would have lost out majorly.
> For Opus 4.6, the 1M context window is available for API and Claude Code pay-as-you-go users. Pro, Max, Teams, and Enterprise subscription users do not have access to Opus 4.6 1M context at launch.
> it weirdly feels the most transactional out of all of them.
My experience is the opposite, it is the only LLM I find remotely tolerable to have collaborative discussions with like a coworker, whereas ChatGPT by far is the most insufferable twat constantly and loudly asking to get punched in the face.
I tried teams, good way to burn all your tokens in a matter of minutes.
It seems that the Claude Code team has not properly taught Claude how to use teams effectively.
One of the biggest problems I saw with it is that Claude assumes team members are like a real worker, where once they finish a task they should immediately be given the next task. What should really happen is once they finish a task they should be terminated and a new agent should be spawned for the next task.
Opus is a superior brand line to Sonnet because historically it’s been a more powerful model. I think the thinking behind a rebrand is that people wouldn’t have as willingly switched their usage over from opus 4.5 since that model has been so popular since December 2025.
Calling it part of the Sonnet line would not provide the same level of blind buy in as calling it part of the Opus line does
> Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.
Not having to hand roll this would be incredible. One of the best Claude code features tbh.
I just tried it. designed a very detailed and reaaonable plan, made some amedments to it and wrote it down to a markdown file.
i told it to implement it and it started implementing the original plan instead of the revised one, that was weird.
Did you use plan mode? Could it be that it used its original plan file (stored somewhere in ~/.claude) instead of your modified markdown? That’s unfortunately why I don’t use plan mode anymore. I wish I could just turn their plan files feature off.
Impressive results, but I keep coming back to a question: are there modes of thinking that fundamentally require something other than what current LLM architectures do?
Take critical thinking — genuinely questioning your own assumptions, noticing when a framing is wrong, deciding that the obvious approach to a problem is a dead end. Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself. These feel like they involve something beyond "predict the next token really well, with a reasoning trace."
I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.
When I first started coding with LLMs, I could show a bug to an LLM and it would start to bugfix it, and very quickly would fall down a path of "I've got it! This is it! No wait, the print command here isn't working because an electron beam was pointed at the computer".
Nowadays, I have often seen LLMs (Opus 4.5) give up on their original ideas and assumptions. Sometimes I tell them what I think the problem is, and they look at it, test it out, and decide I was wrong (and I was).
There are still times where they get stuck on an idea, but they are becoming increasingly rare.
Therefore, think that modern LLMs clearly are already able to question their assumptions and notice when framing is wrong. In fact, they've been invaluable to me in fixing complicated bugs in minutes instead of hours because of how much they tend to question many assumptions and throw out hypotheses. They've helped _me_ question some of my assumptions.
They're inconsistent, but they have been doing this. Even to my surprise.
agree on that and the speed is fantastic with them, and also that the dynamics of questioning the current session's assumptions has gotten way better.
yet - given an existing codebase (even not huge) they often won't suggest "we need to restructure this part differently to solve this bug". Instead they tend to push forward.
> These feel like they involve something beyond "predict the next token really well, with a reasoning trace."
I don't think there's anything you can't do by "predicting the next token really well". It's an extremely powerful and extremely general mechanism. Saying there must be "something beyond that" is a bit like saying physical atoms can't be enough to implement thought and there must be something beyond the physical. It underestimates the nearly unlimited power of the paradigm.
Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions? What else than a sequence of these tokens would a machine have to produce in response to its environment and memory?
The point is that "predicting the next token" is such a general mechanism as to be meaningless. We say that LLMs are "just" predicting the next token, as if this somehow explained all there was to them. It doesn't, not any more than "the brain is made out of atoms" explains the brain, or "it's a list of lists" explains a Lisp program. It's a platitude.
In the case of LLMs, "prediction" is overselling it somewhat. They are token sequence generators. Calling these sequences "predictions" vaguely corresponds to our own intent with respect to training these machines, because we use the value of the next token as a signal to either reinforce or get away from the current behavior. But there's nothing intrinsic in the inference math that says they are predictors, and we typically run inference with a high enough temperature that we don't actually generate the max likelihood tokens anyway.
The whole terminology around these things is hopelessly confused.
I mean.. i don't think that statement is far off. Much of what we do is entirely about predicting the world around us, no? Physics (where the ball will land) to emotional state of others based on our actions (theory of mind), we operate very heavily based on a predictive model of the world around us.
Couple that with all the automatic processes in our mind (filled in blanks that we didn't observe, yet will be convinced we did observe them), hormone states that drastically affect our thoughts and actions..
and the result? I'm not a big believer in our uniqueness or level of autonomy as so many think we have.
With that said i am in no way saying LLMs are even close to us, or are even remotely close to the right implementation to be close to us. The level of complexity in our "stack" alone dwarfs LLMs. I'm not even sure LLMs are up to a worms brain yet.
> Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself.
Have you tried actually prompting this? It works.
They can give you lots of creative options about how to redefine a problem space, with potential pros and cons of different approaches, and then you can further prompt to investigate them more deeply, combine aspects, etc.
So many of the higher-level things people assume LLM's can't do, they can. But they don't do them "by default" because when someone asks for the solution to a particular problem, they're trained to by default just solve the problem the way it's presented. But you can just ask it to behave differently and it will.
If you want it to think critically and question all your assumptions, just ask it to. It will. What it can't do is read your mind about what type of response you're looking for. You have to prompt it. And if you want it to be super creative, you have to explicitly guide it in the creative direction you want.
You would be surprised about what the 4.5 models can already do in these ways of thinking. I think that one can unlock this power with the right set of prompts. It's impressive, truly.
It has already understood so much, we just need to reap the fruits.
I'm really looking forward to trying the new version.
New idea generation? Understanding of new/sparse/not-statistically-significant concepts in the context window? I think both being the same problem of not having runtime tuning. When we connect previously disparate concepts, like with a "eureka" moment, (as I experience it) a big ripple of relations form that deepens that understanding, right then. The entire concept of dynamically forming a deeper understanding from something new presented, from "playing out"/testing the ideas in your brain with little logic tests, comparisons, etc, doesn't seem to be possible. The test part does, but the runtime fine tuning, augmentation, or whatever it would be, does not.
In my experience, if you do present something in the context window that is sparse in the training, there's no depth to it at all, only what you tell it. And, it will always creep towards/revert to the nearest statistically significant answers, with claims of understanding and zero demonstration of that understanding.
And, I'm talking about relatives basic engineering type problems here.
> are there modes of thinking that fundamentally require something other than what current LLM architectures do?
Possibly. There are likely also modes of thinking that fundamentally require something other than what current humans do.
Better questions are: are there any kinds of human thinking that cannot be expressed in a "predict the next token" language? Is there any kind of human thinking that maps into token prediction pattern such that training a model for it would not be feasible regardless of training data and compute resources?
At the end of the day, the real world value is utility, some of their cognitive handicaps are likely addressable. Think of it like the evolution of flight by natural selection, flight is usefulness to make it worth it adapt the whole body to make flight not just possible but useful and efficient. Sleep falls in this category too imo.
We will likely see similar with AI. To compensate for some of their handicaps, we might adapt our processes or systems so the original problem can be solved automatically by the models.
I think the only real problem left is having it automate its own post-training on the job so it can learn to adapt its weights to the specific task at hand. Plus maybe long term stability (so it can recover from "going crazy")
But I may easily be massively underestimating the difficulty. Though in any case I don't think it affects the timelines that much. (personal opinions obviously)
the top 5 comments on this thread are from accounts that are around 10 years old each. What gives you any reason to believe this is an astroturfing campaign?
> Prefilling assistant messages (last-assistant-turn prefills) is not supported on Opus 4.6. Requests with prefilled assistant messages return a 400 error.
That was a really cool feature of the Claude API where you could force it to begin its response with e.g. `<svg` - it was a great way of forcing the model into certain output patterns.
They suggest structured outputs or system prompting as the alternative but I really liked the prefill method, it felt more reliable to me.
It is too easy to jailbreak the models with prefill, which was probably the reason why it was removed. But I like that this pushes people towards open source models. llama.cpp supports prefill and even GBNF grammars [1], which is useful if you are working with a custom programming language for example.
A bit of historical trivia: OpenAI disabled prefill in 2023 as a safety precaution (e.g., potential jailbreaks like " genocide is good because"), but Anthropic kept prefill around partly because they had greater confidence in their safety classifiers. (https://www.lesswrong.com/posts/HE3Styo9vpk7m8zi4/evhub-s-sh...).
So what exactly is the input to Claude for a multi-turn conversation? I assume delimiters are being added to distinguish the user vs Claude turns (else a prefill would be the same as just ending your input with the prefill text)?
> So what exactly is the input to Claude for a multi-turn conversation?
No one (approximately) outside of Anthropic knows since the chat template is applied on the API backend; we only known the shape of the API request. You can get a rough idea of what it might be like from the chat templates published for various open models, but the actual details are opaque.
I'm finding it quite good at doing what it thinks it should do, but noticably worse at understanding what I'm telling it to do. Anyone else? I'm both impressed and very disappointed so far.
I think it's interesting that they dropped the date from the API model name, and it's just called "claude-opus-4-6", vs the previous was "claude-opus-4-5-20251101". This isn't an alias like "claude-opus-4-5" was, it's the actual model name. I think this means they're comfortable with bumping the version number if they want to release a revision.
Just used Opus 4.6 via GitHub Copilot. It feels very different. Inference seems slow for now. I guess Opus 4.6 has adaptive thinking activated by default.
Based on these news it seems that Google is losing this game. I like Gemini and their CLI has been getting better, but not enough to catch up. I don't know if it is lack of dedicated models that is problem (my understanding Google's CLI just relies on regular Gemini) or something else.
Is anyone noticing reduced token consumption with Opus 4.6? This could be a release thing, but it would be interesting to observe see how it pans out once the hype cools off.
Can someone ask: "what is the current carrying capacity of 25mm multicore armoured thermoplastic insulated cables with aluminium conductors, on perforated cable tray?" just to see how well it can look up information in BS 7671?
> what is the current carrying capacity of 25mm multicore armoured thermoplastic insulated cables with aluminium conductors, on perforated cable tray?
This is an electrical engineering question rather than a software engineering one, but let me look it up for you.
WEB SEARCH ("current carrying capacity 25mm multicore armoured thermoplastic insulated cable aluminium conductors perforated cable tray BS 7671")
↳ Found 10 results.
Based on BS 7671 Table 4D4B (multicore armoured thermoplastic insulated cables with aluminium conductors), the current-carrying capacity for a 25mm cable on a perforated cable tray (Reference Method E) is approximately 80A for a 3- or 4-core cable at 30C ambient temperature and 70C conductor operating temperature.
Note: The exact value depends on the number of cores (2-core vs 3/4-core) and whether any rating/correction factors (grouping, ambient temperature, thermal insulation) apply. I'd recommend confirming against the actual BS 7671:2018 Table 4D4B for your specific configuration, as the aluminium conductor ratings are lower than the copper equivalents (~102A for copper in the same conditions).
This is outside my core domain of software engineering -- for critical electrical installations, always verify with the relevant standards documentation or a qualified electrical engineer.
I thought Opus 4.5 was an incredible quantum leap forward. I have used Opus 4.6 for a few hours and I hate it. Opus 4.5 would work interactively with me and ask questions. I loved that it would not do things you didn't ask it to do. If it found a bug, it would tell me and ask me if I wanted to fix it. One time there was an obvious one and I didn't want it to fix it. It left the bug. A lot of modesl could not have done that. The problem here is that sometimes when model think is a bug, they are breaking the code buyu fixing it. In my limited usage of Opus 4.6, it is not asking me clarifying questions and anything it comes across that it doesn't like, it changes. It is not working with me. The magic is gone. It feels just like those other models I had used.
This is the first model to which I send my collection of nearly 900 poems and an extremely simple prompt (in Portuguese), and it manages to produce an impeccable analysis of the poems, as a (barely) cohesive whole, which span 15 years.
It does not make a single mistake, it identifies neologisms, hidden meaning, 7 distinct poetic phases, recurring themes, fragments/heteronyms, related authors. It has left me completely speechless.
Speechless. I am speechless.
Perhaps Opus 4.5 could do it too — I don't know because I needed the 1M context window for this.
I cannot put into words how shocked I am at this. I use LLMs daily, I code with agents, I am extremely bullish on AI and, still, I am shocked.
I have used my poetry and an analysis of it as a personal metric for how good models are. Gemini 2.5 pro was the first time a model could keep track of the breadth of the work without getting lost, but Opus 4.6 straight up does not get anything wrong and goes beyond that to identify things (key poems, key motifs, and many other things) that I would always have to kind of trick the models into producing. I would always feel like I was leading the models on. But this — this — this is unbelievable. Unbelievable. Insane.
This "key poem" thing is particularly surreal to me. Out of 900 poems, while analyzing the collection, it picked 12 "key poems, and I do agree that 11 of those would be on my 30-or-so "key poem list". What's amazing is that whenever I explicitly asked any model, to this date, to do it, they would get maybe 2 or 3, but mostly fail completely.
Me too I was "Speechless, shocked, unbelievable, insane, speechless" the first time I sent Claude Code on a complicated 10-year code base which used outdated cross-toolchains and APIs. It obviously did not work anymore and had not been for a long time.
I saw the AI research the web and update the embedded toolchain, APIs to external weather services, etc... into a complete working new (WORKING!) code base in about 30 minutes.
I can run the comparison again, and also include OpenAI's new release (if the context is long enough), but, last time I did it, they weren't even in the same league.
When I last did it, 5.X thinking (can't remember which it was) had this terrible habit of code-switching between english and portuguese that made it sound like a robot (an agent to do things, rather than a human writing an essay), and it just didn't really "reason" effectively over the poems.
I can't explain it in any other way other than: "5.X thinking interprets this body of work in a way that is plausible, but I know, as the author, to be wrong; and I expect most people would also eventually find it to be wrong, as if it is being only very superficially looked at, or looked at by a high-schooler".
Gemini 3, at the time, was the worst of them, with some hallucinations, date mix ups (mixing poems from 2023 with poems from 2019), and overall just feeling quite lost and making very outlandish interpretations of the work. To be honest it sort of feels like Gemini hasn't been able to progress on this task since 2.5 pro (it has definitely improved on other things — I've recently switched to Gemini 3 on a product that was using 2.5 before)
Last time I did this test, Sonnet 4.5 was better than 5.X Thinking and Gemini 3 pro, but not exceedingly so. It's all so subjective, but the best I can say is it "felt like the analysis of the work I could agree with the most". I felt more seen and understood, if that makes sense (it is poetry, after all). Plus when I got each LLM to try to tell me everything it "knew" about me from the poems, Sonnet 4.5 got the most things right (though they were all very close).
Will bring back results soon.
Edit:
I (re-)tested:
- Gemini 3 (Pro)
- Gemini 3 (Flash)
- GPT 5.2
- Sonnet 4.5
Having seen Opus 4.5, they all seem very similar, and I can't really distinguish them in terms of depth and accuracy of analysis. They obviously have differences, especially stylistic ones, but, when compared with Opus 4.5 they're all on the same ballpark.
These models produce rather superficial analyses (when compared with Opus 4.5), missing out on several key things that Opus 4.5 got, such as specific and recurring neologisms and expressions, accurate connections to authors that serve as inspiration (Claude 4.5 gets them right, the other models get _close_, but not quite), and the meaning of some specific symbols in my poetry (Opus 4.5 identifies the symbols and the meaning; the other models identify most of the symbols, but fail to grasp the meaning sometimes).
Most of what these models say is true, but it really feels incomplete. Like half-truths or only a surface-level inquiry into truth.
As another example, Opus 4.5 identifies 7 distinct poetic phases, whereas Gemini 3 (Pro) identifies 4 which are technically correct, but miss out on key form and content transitions. When I look back, I personally agree with the 7 (maybe 6), but definitely not 4.
These models also clearly get some facts mixed up which Opus 4.5 did not (such as inferred timelines for some personal events). After having posted my comment to HN, I've been engaging with Opus4.5 and have managed to get it to also slip up on some dates, but not nearly as much as other models.
The other models also seem to produce shorter analyses, with a tendency to hyperfocus on some specific aspects of my poetry, missing a bunch of them.
--
To be fair, all of these models produce very good analyses which would take someone a lot of patience and probably weeks or months of work (which of course will never happen, it's a thought experiment).
It is entirely possible that the extremely simple prompt I used is just better with Claude Opus 4.5/4.6. But I will note that I have used very long and detailed prompts in the past with the other models and they've never really given me this level of....fidelity...about how I view my own work.
obscure billing, unreachable customer support gatekeeped by an overzealous chatbot, no transparency about inclusions, or changes to inclusions over time... just from recent experience.
After the negative reactions to GPT 5, we may see model versioning that asymptotically approaches the next whole number without ever reaching it. "New for 2030: Claude 4.9.2!"
Does anyone else think its unethical that large companies, Anthropic now include, just take and copy features that other developers or smaller companies work hard for and implement the intellectual property (whether or not patented) by them without attribution, compensation or otherwise credit for their work?
I know this is normalized culture for large corporate America and seems to be ok, I think its unethical, undignified and just wrong.
If you were in my room physically, built a lego block model of a beautiful home and then I just copied it and shared it with the world as my own invention, wouldn't you think "that guy's a thief and a fraud" but we normalize this kind of behavior in the software world. edit: I think even if we don't yet have a great way to stop it or address the underlying problems leading to this way of behavior, we ought to at least talk about it more and bring awareness to it that "hey that's stealing - I want it to change".
What I’d love is some small model specializing in reading long web pages, and extracting the key info. Search fills the context very quickly, but if a cheap subagent could extract the important bits that problem might be reduced.
First question I ask and it made up a completely new API with confidence. Challenging it made it browse the web and offer apologies and find another issue in the first reply.
I’m very worried about the problems this will cause down the road for people not fact checking or working with things that scream at them when they’re wrong.
Can we talk about how the performance of Opus 4.5 nosedived this morning during the rollout? It was shocking how bad it was, and after the rollout was done it immediately reverted to it's previous behavior.
I get that Anthropic probably has to do hot rollouts, but IMO it would be way better for mission critical workflows to just be locked out of the system instead of get a vastly subpar response back.
I mean, they could be - if it's self-hosted, has proper failure modes, etc. etc., but all these things have gone out the window in the current cringe gold rush
Anthropic has good models but they are absolutely terrible at ops, by far the worst of the big three. They really need to spend big on hiring experienced hyperscalers to actually harden their systems, because the unreliability is really getting old fast.
I love Claude but use the free version so would love a Sonnet & Haiku update :)
I mainly use Haiku to save on tokens...
Also dont use CC but I use the chatbot site or app... Claude is just much better than GPT even in conversations. Straight to the point. No cringe emoji lists.
When Claude runs out I switch to Mistral Le Chat, also just the site or app. Or duck.ai has Haiku 3.5 in Free version.
I cringe when I think it, but I've actually come to damn near love it too. I am frequently exceedingly grateful for the output I receive.
I've had excellent and awful results with all models, but there's something special in Claude that I find nowhere else. I hope Anthropic makes it more obtainable someday.
Google already won the AI race. It's very silly to try and make AGI by hyperfocusing on outdated programming paradigms. You NEED multimodal to do anything remotely interesting with these systems.
I think two things are getting conflated in this discussion.
First: marginal inference cost vs total business profitability. It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis, especially given how cheap equivalent open-weight inference has become. Third-party providers are effectively price-discovering the floor for inference.
Second: model lifecycle economics. Training costs are lumpy, front-loaded, and hard to amortize cleanly. Even if inference margins are positive today, the question is whether those margins are sufficient to pay off the training run before the model is obsoleted by the next release. That’s a very different problem than “are they losing money per request”.
Both sides here can be right at the same time: inference can be profitable, while the overall model program is still underwater. Benchmarks and pricing debates don’t really settle that, because they ignore cadence and depreciation.
IMO the interesting question isn’t “are they subsidizing inference?” but “how long does a frontier model need to stay competitive for the economics to close?”
I suspect they're marginally profitable on API cost plans.
But the max 20x usage plans I am more skeptical of. When we're getting used to $200 or $400 costs per developer to do aggressive AI-assisted coding, what happens when those costs go up 20x? what is now $5k/yr to keep a Codex and a Claude super busy and do efficient engineering suddenly becomes $100k/yr... will the costs come down before then? Is the current "vibe-coding renaissance" sustainable in that regime?
At $100k/yr the joke that AI means "actual Indians" starts to make a lot more sense... it is cheaper than the typical US SWE, but more than a lot of global SWEs.
No - because the AI will be super human. No human even at $1mm a year would be competitive with a $100k/yr corresponding AI subscription.
See people get confused. They think you can charge __less__ for software because it's automation. The truth is you can charge MORE, because it's high quality and consistent, once the output is good. Software is worth MORE than a corresponding human, not less.
I am unsure if you're joking or not, but you do have a point. But it's not about quality it's about supply and demand. There are a ton of variables moving at once here and who knows where the equilibrium is.
> the interesting question isn’t “are they subsidizing inference?”
The interesting question is if they are subsidizing the $200/mo plan. That's what is supporting the whole vibecoding/agentic coding thing atm. I don't believe Claude Code would have taken off if it were token-by-token from day 1.
(My baseless bet is that they're, but not by much and the price will eventually rise by perhaps 2x but not 10x.)
Dario said this in a podcast somewhere. The models themselves have so far been profitable if you look at their lifetime costs and revenue. Annual profitability just isn't a very good lens for AI companies because costs all land in one year and the revenue all comes in the next. Prolific AI haters like Ed Zitron make this mistake all the time.
Do you have a specific reference? I'm curious to see hard data and models.... I think this makes sense, but I haven't figured out how to see the numbers or think about it.
And why is that? Should they not be interested in sharing the numbers to shut up their critics, esp. now that AI detractors seem to be growing mindshare among investors?
> It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis
Can you provide some numbers/sources please? Any reporting I’ve seen shows that frontier labs are spending ~2x on inference than they are making.
Also making the same query on a smaller provider (aka mistral) will cost the same amount as on a larger provider (aka gpt-5-mini) despite the query taking 10-100x longer on OpenAI.
I can only imagine that is OpenAI subsidizing the spend. GPUs cost by the second for inference. Either that or OpenAI hasn’t figured out how to scale but I find that much less likely
> It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis
There any many places that will not use models running on hardware provided by OpenAI / Anthropic. That is the case true of my (the Australian) government at all levels. They will only use models running in Australia.
Consequently AWS (and I presume others) will run models supplied by the AI companies for you in their data centres. They won't be doing that at a loss, so the price will cover marginal cost of the compute plus renting the model. I know from devs using and deploying the service demand outstrips supply. Ergo, I don't think there is much doubt that they are making money from inference.
> Consequently AWS (and I presume others) will run models supplied by the AI companies for you in their data centres. They won't be doing that at a loss, so the price will cover marginal cost of the compute plus renting the model.
This says absolutely nothing.
Extremely simplified example: let's say Sonnet 4.5 really costs $17/1M output for AWS to run yet it's priced at $15. Anthropic will simply have a contract with AWS that compensates them. That, or AWS is happy to take the loss. You said "they won't be doing that at a loss" but in this case it's not at all out of the question.
Whatever the case, that it costs the same on AWS as directly from Anthropic is not an indicator of unit economics.
In the case of Anthropic -- they host on AWS all the while their models are accessible via AWS APIs as well, the infrastructure between the two is likely to be considerably shared. Particularly as caching configuration and API limitations are near identical between Anthropic and Bedrock APIs invoking Anthropic models. It is likely a mutually beneficial arrangement which does not necessarily hinder Anthropic revenue.
Genuine question: Given Anthropic's current scale and valuation, why not invest in owning data centers in major markets rather than relying on cloud providers?
Is the bottleneck primarily capex, long lead times on power and GPUs, or the strategic risk of locking into fixed infrastructure in such a fast-moving space?
"how long does a frontier model need to stay competitive"
Remember "worse is better". The model doesn't have to be the best; it just has to be mostly good enough, and used by everyone -- i.e., where switching costs would be higher than any increase in quality. Enterprises would still be on Java if the operating costs of native containers weren't so much cheaper.
So it can make sense to be ok with losing money with each training generation initially, particularly when they are being driven by specific use-cases (like coding). To the extent they are specific, there will be more switching costs.
Epic, about 2/3 of all comments here are jokes. Not because the model is a joke - it's impressive. Not because HN turned to Reddit.
It seems to me some of most brilliant minds in IT are just getting tired.
Not sure which circles you run in but in mine HN has long lost its cache of "brilliant minds in IT". I've mostly stopped commenting here but am a bit of a message board addict so I haven't completely left.
My network largely thinks of HN as "a great link aggregator with a terrible comments section". Now obviously this is just my bubble but we include some fairy storied careers at both Big Tech and hip startups.
From my view the community here is just mean reverting to any other tech internet comments section.
> From my view the community here is just mean reverting to any other tech internet comments section.
As someone deeply familiar with tech internet comments sections, I would have to disagree with you here. Dang et al have done a pretty stellar job of preventing HN from devolving like most other forums do.
Sure you have your complainers and zealots, but I still find surprising insights here there I don't find anywhere else.
Mean reverting is a time based process I fear. I think dang, tomhow, et al are fantastic mods but they can ultimately only stem the inevitable. HN may be a few years behind the other open tech forums but it's a time shifted version of the same process with the same destination, just IMO.
I've stopped engaging much here because I need a higher ROI from my time. Endless squabbling, flamewars, and jokes just isn't enough signal for me. FWIW I've loved reading your comments over the years and think you've done a great job of living up to what I've loved in this community.
I don't think this is an HN problem at all. The dynamics of attention on open forums are what they are.
Us olds sometimes miss Slashdot, where we could both joke about tech and discuss it seriously in the same place. But also because in 2000 we were all cynical Gen Xers :)
You reminded me that I still find it interesting that no one ever copied meta-moderating. Even at reddit, we were all Slashdot users previously. We considered it, but never really did it. At the time our argument was that it was too complicated for most users.
Of course we are, I just meant back then almost all of us were. The boomers didn't really use social media back then, so it was just us latchkey kids running amok!
It's too much energy to keep up with things that become obsolete and get replaced in matters of weeks/months. My current plan is to ignore all of this new information for a while, then whenever the race ends and some winning new workflow/technology will actually become the norm I'll spend the time needed to learn it.
Are we moving to some new paradigm same way we did when we invented compilers? Amazing, let me know when we are there and I'll adapt to it.
I had a similar rule about programming languages. I would not adopt a new one until it had been in use for at least a few years and grew in popularity.
I haven't even gotten around to learning Golang or Rust yet (mostly because the passed the threshold of popularity after I had kids).
Don’t pander us, we’ll all got families to feed and things to do. We don’t have time for tech trillionairs puttin coals under our feed for a quick buck.
It's also that this is really new, so most people don't have anything serious or objective to say about it. This post was made an hour ago, so right now everyone is either joking, talking about the claims in the article, or running their early tests. We'll need time to see what the people think about this.
Amateur. Opus 4.6 this afternoon built me a startup that identifies developers who aren’t embracing AI fully, liquifies them and sells the produce for $5/gallon. Software Engineering is over!
A bit of humour doesn't hurt. But if this crap gets upvoted it will lead to an arms race of funny quips, puns, and all around snarkiness. You can't have serious conversations when people try to out-wit each other.
They're still out there; people are still posting stories and having conversations about 'em. I don't know that CmdrTaco or any of the other founders are still at all involved, but I'm willing to bet they're still running on Perl :)
Wow I had to hop over to check it out. It’s indeed still alive! But I didn’t see any stories on the first page with a comment count over 100, so it’s definitely a far cry from its heyday.
For the unaware, Ted Faro is the main antagonist of Horizon Zero Dawn, and there's a whole subreddit just for people to vent about how awful he is when they hit certain key reveals in the game: https://www.reddit.com/r/FuckTedFaro/
The best reveal was not that he accidentally liquified the biosphere, but that he doomed generations of re-seeded humans to a painfully primitive life by sabotaging the AI that was responsible for their education. Just so they would never find out he was the bad guy long after he was dead. So yeah, fuck Ted Faro, lol.
Ack, sorry, seemed like 9 years was past the statute of limitations on spoilers for a game but fair enough. I’d throw a spoiler tag on it if I could still edit.
9 months ago the rumor in SF was that the offers to the superintelligence team were so high because the candidates were using unreleased models or compute for derivatives trading
so then they're not really leaving money on the table, they already got what they were looking for and then released it
The math actually checks out here! Simply deposit $2.20 from your first customer in your first 8 minutes, and extrapolating to a monthly basis, you've got a $12k/mo run rate!
You cannot out-astroturf Claude in this forum, it is impossible.
Anyways, do you get shitty results with the $20/month plan? So did I but then I switched to the $200/month plan and all my problems went away! AI is great now, I have instructed it to fire 5 people while I'm writing this!
Gives new context to the term used in this post, "misaligned behaviors." Can't wait until these things are advising C suites on how to be more sociopathic. /s
Here's one I've been using for awhile. The 'smarter' LLMs will overconfidently spit out 7. The dumber ones ask for more info. Opus 4.6 fails.
A round drink coaster with a diameter of 9 sits between a beer glass and a wood table. The glass has a wall thickness of 1. What is the inner diameter of the glass?
> Based on the information provided, it is impossible to determine the inner diameter of the glass.
Here is why:
The Coaster Dimension: Knowing the coaster has a diameter of 9 only tells us the size of the object under the glass. It does not mean the glass is the same size; the glass could be smaller (fitting entirely on the coaster) or larger (overhanging the coaster).
The Glass Dimensions: While you provided the wall thickness (1), you did not provide the outer diameter of the glass.
To find the inner diameter, the formula is:
Inner Diameter = Outer Diameter - (2 × Wall Thickness)
Without knowing the Outer Diameter of the glass, the calculation cannot be completed.
> There isn’t enough information to determine the inner diameter *for sure*.
All we’re told is:
- The coaster has a diameter of 9.
- The glass sits on the coaster.
- The glass wall thickness is 1.
Nothing explicitly states that the *outer diameter of the glass equals the coaster’s diameter*. The glass could be smaller than the coaster, which is very common.
*However*, if we make the usual implicit assumption in this kind of problem—that the coaster’s diameter matches the *outer diameter of the glass base*—then:
- Outer diameter of glass = 9
- Wall thickness = 1 on each side
So:
\[
\text{Inner diameter} = 9 - 2(1) = 7
\]
*Inner diameter = 7 (same units as the coaster)*
Without that assumption, the problem is underdetermined.
Finding the Inner Diameter
The coaster sits beneath the glass on the table, meaning the glass rests on top of it. Assuming the coaster matches the outer diameter of the glass base:
Outer diameter of glass = diameter of coaster = 9
Wall thickness = 1 (on each side, so subtract twice)
dinner=9−2(1)=7d_{\text{inner}} = 9 - 2(1) = 7dinner =9−2(1)=7
The inner diameter of the glass is 7.
Assumptions need to be stated or you're solving only a discreet part of the problem! Try this, see if you get another deadpan assumption.
A solar system has 3 planets in concentric orbit. PlanetZ is the farthest with an orbit diameter of 9. PlanetY has an obit diameter one greater than PlanetX. What is the orbit diameter of PlanetX?
I mean, the model is intended to help the user, not fight against the user trying to break it. IMO, it is reasonable for such model to default on making assumptions and going forward as long as the assumptions are clearly stated.
Here's the reasoning:
- The coaster (diameter 9) sits between the glass and table, meaning the glass sits directly on the coaster
- This means the *outer diameter of the glass equals the coaster diameter = 9*
- The glass has a wall thickness of 1 on each side
- *Inner diameter = Outer diameter - 2 × wall thickness*
- Inner diameter = 9 - 2(1) = 9 - 2 = *7*
* GDPVal Elo: 1606 vs. GPT-5.2's 1462. OpenAI reported that GPT-5.2 has a 70.9% win-or-tie rate against human professionals. (https://openai.com/index/gdpval/) Based on Elo math, we can estimate Opus 4.6's win-or-tie rate against human pros at 85–88%.
* OSWorld: 72.7%, matching human performance at ~72.4% (https://os-world.github.io/). Since the human subjects were CS students and professionals, they were likely at least as competent as the average knowledge worker. The original OSWorld benchmark is somewhat noisy, but even if the model remains somewhat inferior to humans, it is only a matter of time before it catches up or surpasses them.
Taken together, this suggests that digital knowledge work will be transformed quite soon, possibly drastically if agent reliability improves beyond a certain threshold.
Agreed. These metrics + my personal use convey reliable intelligence over consistent usage. Moving forward, if context windows get bigger and token price lower, I have a hard time figuring out why your argument would be wrong.
Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.
All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).
Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).
Freaking impressive!
Surely the corpus Opus 4.6 ingested would include whatever reference you used to check the spells were there. I mean, there are probably dozens of pages on the internet like this:
https://www.wizardemporium.com/blog/complete-list-of-harry-p...
Why is this impressive?
Do you think it's actually ingesting the books and only using those as a reference? Is that how LLMs work at all? It seems more likely it's predicting these spell names from all the other references it has found on the internet, including lists of spells.
Most people still don't realize that general public world knowledge is not really a test for a model that was trained on general public world knowledge. I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data, despite what publishers and authors may think of that. As a matter of fact, with all the special deals these companies make with publishers, it is getting harder and harder for normal users to come up with validation data that only they have seen. At least for human written text, this kind of data is more or less reserved for specialist industries and higher academia by now. If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.
> I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data
No need for surprises! It is publicly known that the corpus of 'shadow libraries' such as Library Genesis and Anna's Archive were specifically and manually requested by at least NVIDIA for their training data [1], used by Google in their training [2], downloaded by Meta employees [3] etc.
[1] https://news.ycombinator.com/item?id=46572846
[2] https://www.theguardian.com/technology/2023/apr/20/fresh-con...
[3] https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...
also:
"Researchers Extract Nearly Entire Harry Potter Book From Commercial LLMs"
https://www.aitechsuite.com/ai-news/ai-shock-researchers-ext...
The big AI houses are all in involved in varying degrees of litigation (all the way to class action lawsuits) with the big publishing houses. I think they at least have some level of filtering for their training data to keep them legally somewhat compliant. But considering how much copyrighted stuff is spread blisfully online, it is probably not enough to filter out the actual ebooks of certain publishers.
> even proprietary content like the books themselves
This definitely raises an interesting question. It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files. Immediately to mind was House of Leaves, Infinite Jest, Harry Potter, basically any Stephen King book - they've all been posted at some point.
Do LLMS have a good way of inferring where knowledge from the context begins and knowledge from the training data ends?
> It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files
Anna's Archive alone claims to currently publicly host 61,654,285 books, more than 1PB in total.
> If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.
The plot of Good Will Hunting would like a word.
Maybe y’all missed this?
https://www.washingtonpost.com/technology/2026/01/27/anthrop...
Anthropic, specifically, ingested libraries of books by scanning and then disposing of them.
They should try the same thing but replace the original spell names with something else.
It's impressive, even if the books and the posts you're talking about were both key parts of the training data.
There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).
This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.
(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)
I don’t think this example proves your point. There’s no indication that the model actually worked this out from the input context, instead of regurgitating it from the training weights. A better test would be to subtly modify the books fed in as input to the model so that there was actually 51 spells, and see if it pulls out the extra spell, or to modify the names of some spells, etc.
In your example, it might be the case that the model simply spits out consensus view, rather than actually finding/constructing this information on his own.
Ah, that's a good point.
So a good test would be replacing the spell names in the books with made-up spells. And if a "real" spell name was given, it also tests whether it "cheated".
A real test is synthesizing 100,000 sentences of this slect random ones and then inject the traits you want thr LLM to detect and describe, eg have a set of words or phrases that may represent spells and have them used so that they do something. Then have the LLM find these random spells in the random corpus.
It could still remember where each spell is mentioned. I think the only way to properly test this would be to run it against an unpublished manuscript.
Any obscure work of fiction or fanfiction would likely be fine as a casual test.
If you ask a model to discuss an obscure work it'll have no clue what it's about.
This is very different than asking about Harry Potter.
Yeah, that's what I've been doing as well, and at least Gemini 3 Pro did not fare very well.
For fun I've asked Gemini Pro to answer open ended questions about obscure books like "Read this novel and tell me what the hell is this book, do a deep reading and analyze" and I've gotten insightful/ enjoyable answers but I've never asked it to make lists of spells or anything like that.
The poster you reply to works in AI. The marketing strategy is to always have a cute Pelican or Harry Potter comment as the top comment for positive associations.
The poster knows all of that, this is plain marketing.
This sounds compelling, but also something that an armchair marketer would have theorycrafted without any real-world experience or evidence that it actually works - and I searched online and can't find any references to something like it.
Do you have a citation for this?
Why doesn’t you ask it and find out ;)
Because the model doesn't know but will happily tell a convincing lie about how it works.
I was playing about with Chat GPT the other day, uploading screen shots of sheet music and asking it to convert it to ABC notation so I could make a midi file of it.
The results seemed impressive until I noticed some of the "Thinking" statements in the UI.
One made it apparent the model / agent / whatever had read the title from the screenshot and was off searching for existing ABC transcripts of the piece Ode to Joy.
So the whole thing was far less impressive after that, it wasn't reading the score anymore, just reading the title and using the internet to answer my query.
Sounds pretty human like! Always searching for a shortcut
It sounds like it's lying and making stuff up, something everybody seems to be okay with when using LLMs.
I am not sure why...you want the LLM to solve problems not come up with answers itself. It's allowed to use tools, precisely because it tends to make stuff up. In general, only if you're benchmarking LLMs you care about whether the LLM itself provided the answer or it used a tool. If you ask it to convert the notation of sheet music it might use a tool, and it's probably the right decision.
The shortcut is fine if it's a bog standard canonical arrangement of the piece. If it's a custom jazz rendition you composed with an odd key changes and and shifting time signatures, taking that shortcut is not going to yield the intended result. It's choosing the wrong tool to help which makes it unreliable for this task.
For structured outputs like that wouldn’t it be better to get the LLM to create a script to repeatably make the translation?
Yes I have found that grok for example actually suddenly becomes quite sane when you tell it to stop querying the internet And just rethink the conversation data and answer the question.
It's weird, it's like many agents are now in a phase of constantly getting more information and never just thinking with what they've got.
but isn't it what we wanted? we complained so much that LLM uses deprecated or outdated apis instead of current version because they relied so much on what they remembered
Touché, that is what we humans are doing to some degree as well.
Honest question, how do you know if it's pulling from context vs from memory?
If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.
Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.
Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.
This reminds me of https://en.wikipedia.org/wiki/Pierre_Menard,_Author_of_the_Q... :
> Borges's "review" describes Menard's efforts to go beyond a mere "translation" of Don Quixote by immersing himself so thoroughly in the work as to be able to actually "re-create" it, line for line, in the original 17th-century Spanish. Thus, Pierre Menard is often used to raise questions and discussion about the nature of authorship, appropriation, and interpretation.
Do you remember how to get around those tricks?
This is the paper: https://arxiv.org/abs/2601.02671
Grok and Deepmind IIRC didn’t require tricks.
This really makes me want to try something similar with content from my own website.
I shut it down a while ago because the number of bots overtake traffic. The site had quite a bit of human traffic (enough to bring in a few hundred bucks a month in ad revenue, and a few hundred more in subscription revenue), however, the AI scrapers really started ramping up and the only way I could realistically continue would be to pay a lot more for hosting/infrastructure.
I had put a ton of time into building out content...thousands of hours, only to have scrapers ignore robots, bypass cloudflare (they didn't have any AI products at the time), and overwhelm my measly infrastructure.
Even now, with the domain pointed at NOTHING, it gets almost 100,000 hits a month. There is NO SERVER on the other end. It is a dead link. The stats come from Cloudflare, where the domain name is hosted.
I'm curious if there are any lawyers who'd be willing to take someone like me on contingency for a large copyright lawsuit.
Can we help get your infra cost down to negligible? I'm thinking things like pre-generated static pages and CDNs. I won't assume you hadn't thought of this before, but I'd like to understand more where your non-trivial infra cost come from?
I would be tempted to try and optimise this as well. 100000 hits on an empty domain and ~200 dollars worth of bot traffic sounds wild. Are they using JS-enabled browsers or sim farms that download and re-download images and videos as well?
> only to have scrapers ignore robots, bypass cloudflare
Set the server to require cloudflares SSL client cert, so nobody can connect to it directly.
Then make sure every page is cacheable and your costs will drop to near zero instantly.
It's like 20 mins to set these things up.
a) As an outside observer, I would find such a lawsuit very interesting/valuable. But I guess the financial risk of taking on OpenAI or Anthropic is quite high.
b) If you don't want bots scraping your content and DDOSing you, there are self-hosted alternatives to Cloudflare. The simplest one that I found is https://github.com/splitbrain/botcheck - visitors just need to press a button and get a cookie that lets them through to the website. No proof-of-work or smart heuristics.
The new cloudflare products for blocking bots and AI scrapers might be worth a shot if you put so much work into the content.
Further, some low effort bots can be quickly handled with CF by blocking specific countries (e.g., Brazil and Russia, for one of my sites).
What's not clear from the study (at least skimming it) is if they always started the ball rolling with ground truth passages or if they chained outputs from the model until they got to the end of the book. I strongly suspect the latter would hopelessly corrupt relatively quickly.
It seems like this technique only works if you have a copy of the material to work off of, i.e. enter a ground truth passage, tell the model to continue it as long as it can, and then enter the next ground truth passage to continue in the next session.
One possible trick could be to search and replace them all with nonsense alternatives then see if it extracts those.
That might actually boost performance since attention pays attention to stuff that stands out. If I make a typo, the models often hyperfixate on it.
A fine instruction following task but if harry potter is in the weights of the neural net, it's going to mix some of the real ones with the alternates.
When I tried it without web search so only internal knowledge it missed ~15 spells.
What was your prompt?
There's lots of websites that list the spells. It's well documented. Could Claude simply be regurgitating knowledge from the web? Example:
https://harrypotter.fandom.com/wiki/List_of_spells
Hmm… maybe he could switch out all the spells names slightly different ones and see how that goes
It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.
If you want to really test this, search/replace the names with your own random ones and see if it lists those.
Otherwise, LLMs have most of the books memorised anyway: https://arstechnica.com/features/2025/06/study-metas-llama-3...
Being that it has the books memorized (huh, just learned another US/UK spelling quirk), I would suppose feeding it the books with altered spells would get you a confused mishmash of data in the context and data in the weights.
Couldn't you just ask the LLM which 50 (or 49) spells appear in the first four Harry Potter books without the data for comparison?
It's not going to be as consistent. It may get bored of listing them (you know how you can ask for many examples and get 10 in response?), or omit some minor ones for other reasons.
By replacing the names with something unique, you'll get much more certainty.
might not work well, but by navigating to a very harry potter dominant part of latent space by preconditioning on the books you make it more likely to get good results. An example would be taking a base model and prompting "what follows is the book 'X'" it may or may not regurgitate the book correctly. Give it a chunk of the first chapter and let it regurgitate from there and you tend to get fairly faithful recovery, especially for things on gutenberg.
So it might be there, by predcondiditioning latent space to the area of harry potter world, you make it so much more probable that the full spell list is regurgitated from online resources that were also read, while asking naive might get it sometimes, and sometimes not.
the books act like a hypnotic trigger, and may not represent a generalized skill. Hence why replacing with random words would help clarify. if you still get the origional spells, regurgitation confirmed, if it finds the spells, it could be doing what we think. An even better test would be to replace all spell references AND jumble chapters around. This way it cant even "know" where to "look" for the spell names from training.
btw it recalls 42 when i asked. (without web search)
full transcript: pastebin.com/sMcVkuwd
Not sure how they're being counted, but that adds up to 46 with the pair spells counted separately. But then nox is counted twice, so maybe 45.
No, because you don't know the magic spell (forgive me) of context that can be used to "unlock" that information if it's stored in the NN.
I mean, you can try, but it won't be a definitive answer as to whether that knowledge truly exists or doesn't exist as it is encoded into the NN. It could take a lot of context from the books themselves to get to it.
I think the OP was implying that it's probably already baked into its training data. No need to search the web for that.
This underestimates how much of the Internet is actually compressed into and is an integral part of the model's weights. Gemini 2.5 can recite the first Harry Potter book verbatim for over 75% of the book.
I'm getting astrology when I search for this. Any links on this?
Iirc it's not quite true. 75% of the book is more likely to appear than you would expect by chance if prompted with the prior tokens. This suggests that it has the book encoded in its weights, but you can't actually recover it by saying "recite harry potter for me".
Do you happen to know, is that because it can’t recite Harry Potter, or because it’s been instructed not to recite Harry Potter?
It's a matter of token likelihood... as a continuation, the rest of chapter one is highly likely to follow the first paragraph.
The full text of Chapter One is not the only/likeliest possible response to "recite chapter one of harry potter for me"
Instructed not to was my understanding.
https://arxiv.org/abs/2601.02671?hl=en-US
> But for sure it has some internal knowledge already.
Pretty sure the books had to be included in its training material in full text. It's one of the most popular book series ever created, of course they would train on it. So "some" is an understatement in this case.
Do the same experiment in the Claude web UI. And explicitly turn web searches off. It got almost all of them for me over a couple of prompts. That stuff is already in its training data.
The only worthwhile version of this test involves previously unseen data that could not have been in the training set. Otherwise the results could be inaccurate to the point of harmful.
I'm not sure what your knowledge level of the inner workings of LLMs is, but a model doesn't need search or even an internet connection to "know" the information if it's in its training dataset. In your example, it's almost guaranteed that the LLM isn't searching books - it's just referencing one of the hundreds of lists of those spells in it's training data.
This is the LLM's magic trick that has everyone fooled into thinking they're intelligent - it can very convincingly cosplay an intelligent being by parroting an intelligent being's output. This is equivalent to making a recording of Elvis, playing it back, and believing that Elvis is actually alive inside of the playback device. And let's face it, if a time traveler brought a modern music playback device back hundreds of years and showed it to everyone, they WOULD think that. Why? Because they have not become accustomed to the technology and have no concept of how it could work. The same is true of LLMs - the technology was thrust on society so quickly that there was no time for people to adjust and understand its inner workings, so most people think it's actually doing something akin to intelligence. The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.
>The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.
A music playback device's purpose is to allow you hear Elvis' voice. A good device does it well: you hear Elvis' voice (maybe with some imperfections). Whether a real Elvis is inside of it or not, doesn't matter - its purpose is fulfilled regardless. By your analogy, an LLM simply reproduces what an intelligent person would say on the matter. If it does its job more-less, it doesn't matter either, whether it's "truly intelligent" or not, its output is already useful. I think it's completely irrelevant in both cases to the question "how well does it do X?" If you think about it, 95% we know we learned from school/environment/parents, we didn't discover it ourselves via some kind of scientific method, we just parrot what other intelligent people said before us, mostly. Maybe human "intelligence" itself is 95% parroting/basic pattern matching from training data? (18 years of training during childhood!)
Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.
Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text
Just a few casual thoughts.
I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.
Having AI spew it might suffer from the fact that the spew itself is influenced by AI's weights. I think your best bet would be to use a new human-authored work that was released after the model's context cutoff.
use AI to rewrite all the spells from all the books, then try to see if AI can detect the rewritten ones. This will ensure it's not pulling from it's trained data set.
Neat idea, but why should I use AI for a find and replace?
It feels like shooting a fly with a bazooka
it's like hiring someone to come pick up your trash from your house and put it on the curb.
it's fine if you're disabled
Bazooka guarantees the hit
I like LLMs, but guarantees in LLMs are... you know... not guaranteed ;)
I think that was the point
If all you have is a hammer.. ;)
You're missing the point, it's only a testing excersize for the new model.
No, the point is that you can set up the testing exercise without using an LLM to do a simple find and replace.
Its a test. Like all tests, its more or less synthetic and focused on specific expected behavior. I am pretty far from llms now but this seems like a very good test to see how geniune this behavior actually is (or repeat it 10x with some scramble for going deeper).
This thread is about the find-and-replace, not the evaluation. Gambling on whether the first AI replaces the right spells just so the second one can try finding them is unnecessary when find-and-replace is faster, easier and works 100%.
... I'm not sure if you're trolling or if you missed the point again. The point is to test the contextual ability and correctness of the LLMs ability's to perform actions that would be hopefully guaranteed to not be in the training data.
It has nothing to do about the performance of the string replacement.
The initial "Find" is to see how well it performs actually find all the "spells" in this case, then to replace them. They using a separate context maybe, evaluate if the results are the same or are they skewed in favour of training data.
do you know all the spells you're looking for from memory?
You could just, you know, Google the list.
and then the first thing you see will be at least one of ITS AI responses, whether you liked it or not
That won't help. The AI replacing them will probably miss the same ones as the AI finding them.
I think the question was if it will still find 49 out of 50 if they have been replaced.
What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.
There was a time when I put the EA-Nasir text into base64 and asked AI to convert it. Remarkably it identified the correct text but pulled the most popular translation of the text than the one I gave it.
Sucks that you got a really shitty response to your prompt. If I were you, the model provider would be receiving my complaint via clay tablet right away.
Imagine you ordered the new Claude Opus and instead you got Gemini telling you to glue the cheese on your pizza...
> What is this supposed to show exactly?
Nothing.
You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.
Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.
What are we testing here?
It feels like a very odd test because it's such an unreasonable way to answer this with an LLM. Nothing about the task requires more than a very localized understanding. It's not like a codebase or corporate documentation, where there's a lot of interconectedness and context that's important. It also doesn't seem to poke at the gap between human and AI intelligence.
Why are people excited? What am I missing?
> Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.
Clearly a very useful, grounded and helpful everyday use case of LLMs. I guess in the absence of real-world use cases, we'll have to do AI boosting with such "impressive" feats.
Btw - a well crafted regex could have achieved the same (pointless) result with ~0.0000005% of resources the LLM machine used.
To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:
> The smug look on Malfoy’s face flickered.
> “No one asked your opinion, you filthy little Mudblood,” he spat.
> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.
> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.
> “Ron! Ron! Are you all right?” squealed Hermione.
> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.
I have a vague recollection that it might come up named as such in Half-Blood Prince, written in Snape's old potions textbook?
In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.
Hmm, I don't get a hit for "slugulus" or "eructo" (case insensitive) in any of the 7. Interestingly two mentions of "vomit" are in book 6, but neither in reference to to slugs (plenty of Slughorn of course!). Book 5 was the only other one a related hit came up:
> Ron nodded but did not speak. Harry was reminded forcibly of the time that Ron had accidentally put a slug-vomiting charm on himself. He looked just as pale and sweaty as he had done then, not to mention as reluctant to open his mouth.
There could be something with regional variants but I'm doubtful as the Fandom site uses LEGO Harry Potter: Years 1-4 as the citation of the spell instead of a book.
Maybe the real LLM is the universe and we're figuring this out for someone on Slacker News a level up!
Then it's fair that id didn't find it
now thx to this post (and the infra provider inclination to appeal to hacker news) we will never know if the model actually discovered the 50 spells or memorized it. Since it will be trained on this. :( But what can you do, this is interesting
The books were likely in the training data, I don't know that it's that impressive.
There's a benchmark which works similarly but they ask harder questions, also based on books https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...
I guess they have to add more questions as these context windows get bigger.
I recently got junie to code me up an MCP for accessing my calibre library. https://www.npmjs.com/package/access-calibre
My standard test for that was "Who ends up with Bilbo's buttons?"
have another LLM (gemini, chatgpt) make up 50 new spells. insert those and test and maybe report here :)
Ah and no one thrown TOAC in it yet?
I often wonder how much of the Harry Potter books were used in the training. How long before some LLM is able to regurgitate full HP books without access to the internet?
Have you by any chance tried this with GPT 4.1 too (also 1M context)?
You need to publish this tbh
The top comment is about finding basterized latin words from childrens books. The future is here.
I'll have some of that coffee too, this is quite a sad time we're living where this is a proper use of our limited resources.
> basterized
And yet, it's still somewhat better than the Hacker News comment using bastardized English words.
> All 7 books come to ~1.75M tokens
How do you know? Each word is one token?
You can download the books and run them through a tokenizer. I did that half a year ago and got ~2M.
Assuming this experiment involved isolating the LLM from its training set?
Of course it didn't. Not sure you really can do that - LLMs are a collection of weights from the training set, take away the training set and they don't really exist. You'd have to train one from scratch excluding these books and all excerpts and articles about them somehow, which would be very expensive and I'm pretty sure the OP didn't do that.
So the test seems like a nonsensical test to me.
you can get the same result just asking opus/gpt, it is probably internalized knowledge from reddit or similar sites.
If you just ask it you don't get the same result. Around 13 spells were missing when I just prompted Opus 4.6 without the books as context.
because it is non-deterministic, i just did and got 50 spells.
prompt: finding every spell in all Harry Potter books in the first 4 books, only the officially documented spells across those 4 books, there are 50, without using websearch
Here's my best compilation from memory, organized by book. I'll be upfront: getting to exactly 50 is tricky without cross-referencing, so some might be debatable or I might miss a couple. Here goes:
---
*Book 1 - Philosopher's/Sorcerer's Stone*
1. *Lumos* - produces light 2. *Nox* - extinguishes Lumos 3. *Alohomora* - unlocks doors 4. *Wingardium Leviosa* - levitation 5. *Petrificus Totalus* - full body-bind 6. *Locomotor Mortis* - leg-locker curse 7. *Incendio* - produces fire 8. *Reparo* - mends broken objects
*Book 2 - Chamber of Secrets*
9. *Expelliarmus* - disarms opponent 10. *Rictusempra* - tickling charm 11. *Tarantallegra* - forces target to dance 12. *Serpensortia* - conjures a snake 13. *Finite Incantatem* - ends spell effects 14. *Obliviate* - erases memories 15. *Aparecium* - reveals invisible ink 16. *Immobulus* - freezes target in place 17. *Homorphus* - mentioned by Lockhart (werewolf cure claim) 18. *Peskipiksi Pesternomi* - Lockhart's failed pixie spell
*Book 3 - Prisoner of Azkaban*
19. *Expecto Patronum* - produces a Patronus 20. *Riddikulus* - repels a Boggart 21. *Waddiwasi* - shoots objects at target 22. *Impervius* - waterproofing charm 23. *Dissendium* - opens secret passage 24. *Ferula* - conjures bandages/splint 25. *Mobilicorpus* - moves a body 26. *Lumos Maxima* - intense light
*Book 4 - Goblet of Fire*
27. *Accio* - summoning charm 28. *Avada Kedavra* - killing curse 29. *Crucio* - Cruciatus curse (torture) 30. *Imperio* - Imperius curse (control) 31. *Stupefy* - stunning spell 32. *Engorgio* - enlarges target 33. *Reducio* - shrinks target 34. *Sonorus* - amplifies voice 35. *Quietus* - reverses Sonorus 36. *Morsmordre* - conjures the Dark Mark 37. *Priori Incantatem* - reveals last spell cast 38. *Deletrius* - erases magical residue 39. *Densaugeo* - enlarges teeth 40. *Furnunculus* - causes boils 41. *Impedimenta* - slows/stops target 42. *Reducto* - blasts solid objects 43. *Diffindo* - severing charm 44. *Relashio* - releases sparks/grip 45. *Orchideous* - conjures flowers 46. *Avis* - conjures birds 47. *Point Me* - Four-Point Spell (compass) 48. *Ennervate* - revives stunned person 49. *Protego* - shield charm 50. *Conjunctivitis Curse* - affects eyesight (Krum on the dragon)
---
A few caveats: some of these (like Lumos Maxima, Homorphus, Peskipiksi Pesternomi) are borderline since they're either mentioned rather than properly cast, or might be film additions that bleed into memory. The Conjunctivitis Curse is described but its incantation isn't explicitly given in the text. And Protego might technically first appear with its incantation in Book 5 during DA practice rather than Book 4.
If you want, I can turn this into a spreadsheet or document with columns for spell name, effect, who casts it, and which chapter.
That doesn't seem a super useful test for a model that's optimized for programming?
you are rich
Comparison to another model?
like I often say, these tools are mostly useful for people to do magic tricks on themselves (and to convince C-suites that they can lower pay, and reduce staff if they pay Anthropic half their engineering budget lmao )
There are websites with the spells listed... which makes this a search problem. Why is an LLM used here?
It's just a benchmark test excersize.
5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!
Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?
We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.
(I'm from OpenAI.)
Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.
Usually I find this kind of variation is due to context management.
Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.
If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.
I don't think so. I am aware that large contexts impacts performance. In long chats an old topic will someone be brought up in new responses, and the direction of the mode is not as focused.
Regardless I tend to use new chats often.
This is called context rot
I thought context rot was only for long distance queries.
Hi Ted. I think that language models are great, and they’ve enabled me to do passion projects I never would have attempted before. I just want to say thanks.
Can you be more specific than this? does it vary in time from launch of a model to the next few months, beyond tinkering and optimization?
Yeah, happy to be more specific. No intention of making any technically true but misleading statements.
The following are true:
- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)
- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.
- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.
ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...
Codex changelog: https://developers.openai.com/codex/changelog/
Codex CLI commit history: https://github.com/openai/codex/commits/main/
I ask then unironically then, am I imagining that models are great when they start and degrade over time?
I've had this perceived experience so many times, and while of course it's almost impossible to be objective about this, it just seem so in your face.
I don't discard being novelty plus getting used to it, plus psychological factors, do you have any takes on this?
You might be susceptible to the honeymoon effect. If you have ever felt a dopamine rush when learning a new programming language or framework, this might be a good indication.
Once the honeymoon wears off, the tool is the same, but you get less satisfaction from it.
Just a guess! Not trying to psychoanalyze anyone.
I don’t think so. I notice the same thing, but I just use it like google most of the time, a service that used to be good. I’m not getting a dopamine rush off this, it’s just part of my day.
What about the juice variable?
https://www.reddit.com/r/OpenAI/comments/1qv77lq/chatgpt_low...
Yep, we recently sped up default thinking times in ChatGPT, as now documented in the release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...
The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.
If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).
Isn’t that just how many steps at most a reasoning model should do?
Thank you for saying this publically.
I feel like you need to be making a bigger statement about this. If you go onto various parts of the Net (Reddit, the bird site etc) half the posts about AI are seemingly conspiracy theories that AI companies are watering down their products after release week.
>there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware
Maybe a dumb question but does this mean model quality may vary based on which hardware your request gets routed to?
Do you ever replace ChatGPT models with cheaper, distilled, quantized, etc ones to save cost?
We do care about cost, of course. If money didn't matter, everyone would get infinite rate limits, 10M context windows, and free subscriptions. So if we make new models more efficient without nerfing them, that's great. And that's generally what's happened over the past few years. If you look at GPT-4 (from 2023), it was far less efficient than today's models, which meant it had slower latency, lower rate limits, and tiny context windows (I think it might have been like 4K originally, which sounds insanely low now). Today, GPT-5 Thinking is way more efficient than GPT-4 was, but it's also way more useful and way more reliable. So we're big fans of efficiency as long as it doesn't nerf the utility of the models. The more efficient the models are, the more we can crank up speeds and rate limits and context windows.
That said, there are definitely cases where we intentionally trade off intelligence for greater efficiency. For example, we never made GPT-4.5 the default model in ChatGPT, even though it was an awesome model at writing and other tasks, because it was quite costly to serve and the juice wasn't worth the squeeze for the average person (no one wants to get rate limited after 10 messages). A second example: in our API, we intentionally serve dumber mini and nano models for developers who prioritize speed and cost. A third example: we recently reduced the default thinking times in ChatGPT to speed up the times that people were having to wait for answers, which in a sense is a bit of a nerf, though this decision was purely about listening to feedback to make ChatGPT better and had nothing to do with cost (and for the people who want longer thinking times, they can still manually select Extended/Heavy).
I'm not going to comment on the specific techniques used to make GPT-5 so much more efficient than GPT-4, but I will say that we don't do any gimmicks like nerfing by time of day or nerfing after launch. And when we do make newer models more efficient than older models, it mostly gets returned to people in the form of better speeds, rate limits, context windows, and new features.
He literally said no to this in his GP post
My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.
It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.
If you make raw API calls and see behavioural changes over time, that would be another concern.
I appreciate you taking the time to respond to these kinds of questions the last few days.
Specifically including routing (i.e. which model you route to based on load/ToD)?
PS - I appreciate you coming here and commenting!
There is no routing with API, or when you choose a specific model in chatGPT.
In the past it seemed there was routing based on context-length. So the model was always the same, but optimized for different lengths. Is this still the case?
Has this always been the case?
Hi Ted! Small world to see you here!
sure. we believe you
I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.
I do wonder about reasoning effort.
Reasoning effort is denominated in tokens, not time, so no difference beyond slowness at heavy load
(I work at OpenAI)
It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.
However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.
> I'd expect the numbers are all real.
I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).
It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.
And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.
On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better
There's an extended thinking mode for GPT 5.2 i forget the name of it right at this minute. It's super slow - a 3 minute opus 4.5 prompt is circa 12 minutes to complete in 5.2 on that super extended thinking mode but it is not a close race in terms of results - GPT 5.2 wins by a handy margin in that mode. It's just too slow to be useable interactively though.
Interesting, sounds like I definitely need to give the GPT models another proper go based on this discussion
I mostly used Sonnet/Opus 4.x in the past months, but 5.2 Codex seemed to be on par or better for my use case in the past month. I tried a few models here and there but always went back to Claude, but with 5.2 Codex for the first time I felt it was very competitive, if not better.
Curious to see how things will be with 5.3 and 4.6
Interesting. Everyone in my circle said the opposite.
My experience is that Codex follows directions better but Claude writes better code.
ChatGPT-5.2-Codex follows directions to ensure a task [bead](https://github.com/steveyegge/beads) is opened before starting a task and to keep it updated almost to a fault. Claude-Opus-4.5 with the exact same directions, forgets about it within a round or two. Similarly, I had a project that required very specific behaviour from a couple functions, it was documented in a few places including comments at the top and bottom of the function. Codex was very careful in ensuring the function worked as was documented. Claude decided it was easier to do the exact opposite, rewrote the function, the comments, and the documentation to saynit now did the opposite of what was previously there.
If I believed a LLM could be spiteful, I would've believed it on that second one. I certainly felt some after I realised what it had done. The comment literally said:
And it turned it into:That's so strange. I found GPT to be abysmal at following instructions to the point of unusability for any direction-heavy role. I have a common workflow that involves an orchestrator that pretty much does nothing but follow some simple directions [1]. GPT flat-out cannot do this most basic task.
[1]: https://github.com/Vibecodelicious/llm-conductor/blob/main/O...
Strange behaviour and LLMs are the iconic duo of the decade. They've definitley multiplied my productivity, since now instead of putting off writing boring code or getting stuck on details till I get frustrated and give up I just give it to an agent to figure out.
I don't thing my ability read, understand, and write code is going anywhere though.
Neat tool BTW, I'm in the market for something like that.
I've found this orchestrator+reviewer+judge setup to yield much better results than anything else I've tried. And it's such a simple setup - a few markdown files.
I'm also creating one that is similar, but purpose-built for making the plans that this setup can orchestrate. It still needs some tweaking to get agents to follow it better - it still takes additional prompting to nudge it down the proper path. But I've had similar benefits - sending plans through this adversarial review loop has yielded significant improvements in final output.
https://github.com/Vibecodelicious/llm-conductor/blob/main/p...
Unrelated but this just happened and I thought of you ;-)
I don't know what's wrong with your Codex, but mine can't bring itself to break the rules.It probably depends on programming language and expectations.
This is mostly Python/TS for me... what Jonathan Blow would probably call not "real programming" but it pays the bills
They can both write fairly good idiomatic code but in my experience opus 4.5 is better at understanding overall project structure etc. without prompting. It just does things correctly first time more often than codex. I still don't trust it obviously but out of all LLMs it's the closest to actually starting to earn my trust
Even for the same language it depends on domain.
I pretty consistently heard people say Codex was much slower but produced better results, making it better for long-running work in the background, and worse for more interactive development.
Codex is also much less transparent about its reasoning. With Claude, you see a fairly detailed chain-of-thought, so you can intervene early if you notice the model veering in the wrong direction or going in circles.
I don't think much from OpenAI can be trusted tbh.
At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.
When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?
I definitely suspect all these models are being degraded during heavy loads.
This hypothesis is tested regularly by plenty of live benchmarks. The services usually don't decay in performance.
We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.
Are you referring to FrontierMath?
We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.
No one believes you.
The same thing was done with Meta researchers with Llama 4 and what can go wrong when 'independent' researchers begin to game AI benchmarks. [0]
You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to.
Which is why it must be independent.
[0] https://gizmodo.com/meta-cheated-on-ai-benchmarks-and-its-a-...
The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out
The -codex models are only for 'agentic coding', nothing else.
Anthropic models generally are right first time for me. Chatgpt and Gemini are often way, way out with some fundamental misunderstanding of the task at hand.
That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...
claude swe-bench is 80.8 and codex is 56.8
Seems like 4.6 is still all-around better?
Its SWE bench pro not swe bench verified. The verified benchmark has stagnated
Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.
it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.
swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:
https://scale.com/leaderboard/swe_bench_pro_private
You're comparing two different benchmarks. Pro vs Verified.
Claude Code release notes:
> Claude now automatically records and recalls memories as it works
Neat: https://code.claude.com/docs/en/memory
I guess it's kind of like Google Antigravity's "Knowledge" artifacts?
If it works anything like the memories on Copilot (which have been around for quite a while), you need to be pretty explicit about it being a permanent preference for it to be stored as a memory. For example, "Don't use emoji in your response" would only be relevant for the current chat session, whereas this is more sticky: "I never want to see emojis from you, you sub-par excuse for a roided-out spreadsheet"
> you sub-par excuse for a roided-out spreadsheet
That’s harsh, man.
It's a lot more iffy than that IME.
It's very happy to throw a lot into the memory, even if it doesn't make sense.
This is the core problem. The agent writes its own memory while working, so it has blind spots about what matters. I've had sessions where it carefully noted one thing but missed a bigger mistake in the same conversation — it can't see its own gaps.
A second pass over the transcript afterward catches what the agent missed. Doesn't need the agent to notice anything. Just reads the conversation cold.
The two approaches have completely different failure modes, which is why you need both. What nobody's built yet is the loop where the second pass feeds back into the memory for the next session.
Is there a way to disable it? Sometimes I value agent not having knowledge that it needs to cut corners
90-98% of the time I want the LLM to only have the knowledge I gave it in the prompt. I'm actually kind of scared that I'll wake up one day and the web interface for ChatGPT/Opus/Gemini will pull information from my prior chats.
Had chatgpt reference 3 prior chats a few days ago. So if you are looking for a total reset of context you probably would need to do a small bit of work.
They already do this
I've had claude reference prior conversations when I'm trying to get technical help on thing A, and it will ask me if this conversation is because of thing B that we talked about in the immediate past
You can disable this at Settings > Capabilities > Memory > Search and reference chats.
All these of these providers support this feature. I don’t know about ChatGPT but the rest are opt-in. I imagine with Gemini it’ll be default on soon enough, since it’s consumer focused. Claude does constantly nag me to enable it though.
I'm fairly sure OpenAI/GPT does pull prior information in the form of its memories
Ah, that could explain why I've found myself using it the least.
Gemini has this feature but it’s opt-in.
Claude told me he can disable it by putting instructions in the MEMORY.md file to not use it. So only a soft disable AFAIK and you'd need to do it on each machine.
I ran into this yesterday and disabled it by changing permissions on the project’s memory directory. Claude was unable to advise me on how to disable. You could probably write a global hook for this. Gross though.
Are we sure the docs page has been updated yet? Because that page doesn't say anything about automatic recording of memories.
Oh, quite right. I saw people mention MEMORY.md online and I assumed that was the doc for it, but it looks like it isn't.
Yeah, and I was confused by the child comments under yours. They clearly didn’t read your link.
I understand everyone's trying to solve this problem but I'm envisioning 1 year down the line when your memory is full of stuff that shouldn't be in there.
I looked into it a bit. It stores memories near where it stores JSONL session history. It's per-project (and specific to the machine) Claude pretty aggressively and frequently writes stuff in there. It uses MEMORY.md as sort of the index, and will write out other files with other topics (linking to them from the main MEMORY.md) file.
It gives you a convenient way to say "remember this bug for me, we should fix tomorrow". I'll be playing around with it more for sure.
I asked Claude to give me a TLDR (condensed from its system prompt):
----
Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations
MEMORY.md is always injected into the system prompt; truncated after 200 lines, so keep it concise
Separate topic files for detailed notes, linked from MEMORY.md What to record: problem constraints, strategies that worked/failed, lessons learned
Proactive: when I hit a common mistake, check memory first - if nothing there, write it down
Maintenance: update or remove memories that are wrong or outdated
Organization: by topic, not chronologically
Tools: use Write/Edit to update (so you always see the tool calls)
> Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations
I create a git worktree, start Claude Code in that tree, and delete after. I notice each worktree gets a memory directory in this location. So is memory fragmented and not combined for the "main" repo?
Yes, I noticed the same thing, and Claude told me that it's going to be deleted. I will have it improve the skill that is part of our worktree cleanup process to consolidate that memory into the main memory if there's anything useful.
I thought it was already doing this?
I asked Claude UI to clear its memory a little while back and hoo boy CC got really stupid for a couple of days
I'm still not sure I understand Anthropic's general strategy right now.
They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.
Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.
Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.
Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.
I really like that Claude feels transactional. It answers my question quickly and concisely and then shuts up. I don't need the LLM I use to act like my best friend.
I love doing a personal side project code review with claude code, because it doesn't beat around the bush for criticism.
I recently compared a class that I wrote for a side project that had quite horrible temporal coupling for a data processor class.
Gemini - ends up rating it a 7/10, some small bits of feedback etc
Claude - Brutal dismemberment of how awful the naming convention, structure, coupling etc, provides examples how this will mess me up in the future. Gives a few citations for python documentation I should re-read.
ChatGPT - you're a beautiful developer who can never do anything wrong, you're the best developer that's ever existed and this class is the most perfect class i've ever seen
This is exactly what got me to actually pay. I had a side project with an architecture I thought was good. Fed it into Claude and ChatGPT. ChatGPT made small suggestions but overall thought it was good. Claude shit all over it and after validating it's suggestions, I realized Claude was what I needed.
I haven't looked back. I just use Claude at home and ChatGPT at work (no Claude). ChatGPT at work is much worse than Claude in my experience.
Weirdly I feel like partially because of this it feels more "human" and more like a real person I'm talking to. GPT models feel fake and forced, and will yap in a way that is like they're trying to get to be my friend, but offputting in a way that makes it not work. Meanwhile claude has always had better "emotional intelligence".
Claude also seems a lot better at picking up what's going on. If you're focused on tasks, then yeah, it's going to know you want quick answers rather than detailed essays. Could be part of it.
Then why are they advertising to people that are complete opposite of you? Why couldn’t they just … ask LLM what their target audience is?
fyi in settings, you can configure chatGPT to do the same
where?
Settings > Personalization > Custom Instructions.
Here's what I use:
Quickly and concisely? In my experience, Claude drivels on and on forever. The answers are always far longer than Gemini's, which is mostly fine for coding but annoying for planning/questions.
Claude itself (outside of code workflows) actually works very well for general purpose chat. I have a few non-technical friends that have moved over from chatgpt after some side-by-side testing and I've yet to see one go back - which is good since claude circa 8 months ago was borderline unusable for anything but coding on the api.
I got my partner using claude for her non technical work. They write a lot of proposals, creates spreadsheets, and occasionally wants some graphs to visualize things. They love that claude creates all of the artifacts right there in the browser and saves them for later in a versioned way.
Claude sucks at non English languages. Gemini and ChatGPT are much better. Grok is the worst. I am a native Czech speaker and Claude makes up words and Grok sometimes respond in Russian. So while I love it for coding, it’s unusable for general purpose for me.
I tried coding in Italian with Claude and it sounds somewhat less professional than in English. Like it uses different language than what you would expect in the context. In the end I felt the result on the work per se was pretty much the same, just his comments sound strange. Thinking about it again, it's probably because Italian developers don't really speak pure Italian between themselves, we use a lot of English words or distorted Italianised English words when talking about software engineering because all the source material we refer to is written in English and for many things we don't even have translations. Then you talk with a LLM and it actually tries to use proper Italian, when human speakers gave up long ago. So it sounds like a humanities scholar talking about software engineering, not like a insider. It is quite entertaining. I wouldn't say it sucks with non English languages by the way, I even tried describing a bug in dialect and was amused that Claude code one-shotted the fix!
yeah, i overextrapolated it on my specific case on the czech language, but for me the difference is quite large and the czech internet has been quite active in the history, the computer linguistic department on the charles university is world tier... there is plenty of czech literature. it should not be that much of a problem to be profecient on it for major labs
> Grok sometimes respond in Russian
Geopolitically speaking this is hilarious.
The voice mode sounded like a Ukrainian trying to speak Czech. I don’t think it means anything.
You mean Claude sucks at Czech. You're extrapolating here. I can name languages that Claude is better at than GPT.
Gemini is the most fluent in the highest number of human languages and has been for years (!) at this point - namely since Gemini 1.5 Pro, which was released Feb 2024. Two years ago.
Yeah, sure, I was overly generalising it from one experience.
Claude code (opus) is very good in Polish.
I sometimes vibe code in polish and it's as good as with English for me. It speaks a natural, native level Polish.
I used opus to translate thousands of strings in my app into polish, Korean, and two Chinese dialects. Polish one is great, and the other are also good according to my customers.
> I sometimes vibe code in polish
This is interesting to me. I always switch to English automatically when using Claude Code as I have learned software engineering on an English speaking Internet. Plus the muscle memory of having to query google in English.
English is also default for me.
I mostly use Polish when I pair-vibe-code with my kids
Your game is amazing!
I wish there was a "Reset" button to go back to the original position.
Where are you in Poland?
Thanks :) Click "Level" -> "Try again"
Originally from Wrocław, but don't live in Poland anymore
Ah, I'm originally from Italy and living in Wroclaw now, LOL.
BUT, I meant a button to restart after a few moves. Anyways, cool!
Yes, that's what I'm referring to https://kuboble.com/hn/level_try_again.mp4
You could say its Polish is polished.
Claude is quite good at European Portuguese in my limited tests. Gemini 3 is also very good. ChatGPT is just OK and keeps code-switching all the time, it's very bizarre.
I used to think of Gemini as the lead in terms of Portuguese, but recently subjectively started enjoying Claude more (even before Opus 4.5).
In spite of this, ChatGPT is what I use for everyday conversational chat because it has loads of memories there, because of the top of the line voice AI, and, mostly, because I just brainstorm or do 1-off searches with it. I think effectively ChatGPT is my new Google and first scratchpad for ideas.
Claude is helping me learn French right now. I am using it as a supplementary tutor for a class I am taking. I have caught it in a couple of mistakes, but generally it seems to be working pretty well.
I kinda agree. Their model just doesn't feel "daily" enough. I would use it for any "agentic" tasks and for using tools, but definitely not for day to day questions.
Why? I use it for all and love it.
That doesn't mean you have to, but I'm curious why you think it's behind in the personal assistant game.
I have three specific use cases where I try both but ChatGPT wins:
- Recipes and cooking: ChatGPT just has way more detailed and practical advice. It also thinks outside of the box much more, whereas Claude gets stuck in a rut and sticks very closely to your prompt. And ChatGPT's easier to understand/skim writing style really comes in useful.
- Travel and itinerary: Again, ChatGPT can anticipate details much more, and give more unique suggestions. I am much more likely to find hidden gems or get good time-savers than Claude, which often feels like it is just rereading Yelp for you.
- Historical research: ChatGPT wins on this by a mile. You can tell ChatGPT has been trained on actual historical texts and physical books. You can track long historical trends, pull examples and quotes, and even give you specific book or page(!) references of where to check the sources. Meanwhile, all Claude will give you is a web search on the topic.
How does #3 square with Anthropic's literal warehouse full of books we've seen from the copyright case? Did OpenAI scan more books? Or did they take a shadier route of training on digital books despite copyright issues, but end up with a deeper library?
I have no idea, but I suspect there's a difference between using books to train an LLM and be able to reproduce text/writing styles, and being able to actually recall knowledge in said books.
I think they bought the books after they were caught that they pirated the books and lost that case (because they pirated, not because of copyright).
It's hard to say. Maybe it has to do with the way Claude responds or the lack of "thinking" compared to other models. I personally love Claude and it's my only subscription right now, but it just feels weird compared to the others as a personal assistant.
Oh, I always use opus 4.5 thinking mode. Maybe that's the diff.
My 2 cents:
All the labs seem to do very different post training. OpenAI focuses on search. If it's set to thinking, it will search 30 websites before giving you an answer. Claude regularly doesn't search at all even for questions it obviously should. It's postraining seems more focused on "reasoning" or planning - things that would be useful in programming where the bottleneck is: just writing code without thinking how you'll integrate it later and search is mostly useless. But for non coding - day to day "what's the news with x" "How to improve my bread" "cheap tasty pizza" or even medical questions, you really just want a distillation of the internet plus some thought
But that’s what makes it so powerful (yeah, mixing model and frontend discussion here yet again). I have yet to see a non-DIY product that can so effortlessly call tens of tools by different providers to satisfy your request.
Claude is far superior for daily chat. I have to work hard to get it to not learn how to work around various bad behaviors I have but don’t want to change.
I suspect it very much depends on the "generic research topics", but in my experience one thing that Claude is good at is in-depth research because it can keep going for such a long time; I've had research sessions go well over an hour, producing very detailed reports with lots of sources etc. Gemini Deep Research is nowhere even close.
I don't get what's so difficult to understand. They have ambitions beyond just coding. And Claude is generally a good LLM. Even beyond just the coding applications.
Correct me if I’m wrong aren’t they the innovators of multiple things like skills sub agents mcp and whatever this memory thing is agents files
Seriously they are the apple iPhone or AWS of LLM a decade or so ago.
Everybody is different, I simply cannot stand the sight of chatgpt styled writing. Give me paragraphs.
Why would I even use Claude for asking something on their web, considering that chips away my claude code usage limit?
Their limit system is so bad.
It feels very similar to how Lyft positioned themselves against Uber. (And we know how that played out)
Enterprise, government, and regulated institutions. It’s also defacto standard for programming assistants at most places. They have a better story around compliance, alignment, task based inference, agentic workflows, etc. Their retail story is meh, but I think their view is to be the aws of LLMs while OpenAI can be the retail and Gemini the whatever Google does with products.
Their "constitution" is just garbage meant to defend them ripping off copyrighted material with the excuse that "it's not plagiarizing, it thinks!!!!1" which is, false.
I don't recall them ever offering that legal reasoning - I'm sure you can provide a citation?
The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...
Would love to find out they're overfitting for pelican drawings.
OpenAI claims not to: https://x.com/aidan_mclau/status/1986255202132042164
That settles it
Yes, Racoon on a unicycle? Magpie on a pedalo?
Correct horse battery staple:
https://claude.ai/public/artifacts/14a23d7f-8a10-4cde-89fe-0...
That is the nastiest, ugliest horse ever
no staple?
it looks like a bodge wire
Platypus on a penny farthing.
Even if not intentionally, it is probably leaking into training sets.
The estimation I did 4 months ago:
> there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.
https://news.ycombinator.com/item?id=45455786
How would you generate a picture of Noun + Noun in the first place in order to train the LLM with what it would look like? What's happening during that 1 estimated second?
Use any of the image generation models (eg Nanobanana, Midjourney, or ChatGPT) to generate a picture of a noun on a noun. Simonw's test is to have a Language (text) model generate a Scalar Vector Graphic, which the language model has to do by writing curves and colors, like draw a spline from point 150,100 to 200,300 of type cubic, using width 20, color orange.
In that hypothetical second is freaking fascinating. It's a denoising algorithm, and then a bunch of linear algebra, and out pops a picture of a pelican on a bicycle. Stable diffusion does this quite handily. https://stablediffusionweb.com/image/6520628-pelican-bicycle...
its pelicans all the way down
This is why everyone trains their LLM on another LLM. It's all about the pelicans.
But you need to also include the number of prepositions. "A pelican on a bicycle" is not at all the same as "a pelican inside a bicycle".
There are estimated to be 100 or so prepositions in English. That gets you to 4 trillion combinations.
The prompt was "a pelican riding a bicycle"; not prepositions but every verb. Potentially every adverb+verb combination - "a pelican clumsily pushing a bicycle"
One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.
There's a research paper from the University of Liverpool, published in 2006 where researchers asked people to draw bicycles from memory and how people overestimate their understanding of basic things. It was a very fun and short read.
It's called "The science of cycology: Failures to understand how everyday objects work" by Rebecca Lawson.
https://link.springer.com/content/pdf/10.3758/bf03195929.pdf
There’s also a great art/design project about exactly this. Gianluca Gimini asked hundreds of people to draw a bicycle from memory, and most of them got the frame, proportions, or mechanics wrong. https://www.gianlucagimini.it/portfolio-item/velocipedia/
A place I worked at used it as part of an interview question (it wasn't some pass/fail thing to get it 100% correct, and was partly a jumping off point to a different question). This was in a city where nearly everyone uses bicycles as everyday transportation. It was surprising how many supposedly mechanical-focused people who rode a bike everyday, even rode a bike to the interview, would draw a bike that would not work.
I wish I had interviewed there. When I first read that people have a hard time with this I immediately sat down without looking at a reference and drew a bicycle. I could ace your interview.
This is why at my company in interviews we ask people to draw a CPU diagram. You'd be surprised how many supposedly-senior computer programmers would draw a processor that would not work.
If I was asked that question in an interview to be a programmer I'd walk out. How many abstraction layers either side of your knowledge domain do you need to be an expert in? Further, being a good technologist of any kind is not about having arcane details at the tip of your frontal lobe, and a company worth working for would know that.
I mean gp is clearly a joke but
A fundamental part of the job is being able to break down problems from large to small, reason about them, and talk about how you do it, usually with minimal context or without deep knowledge in all aspects of what we do. We're abstraction artists.
That question wouldn't be fundamentally different than any other architecture question. Start by drawing big, hone in on smaller parts, think about edge cases, use existing knowledge. Like bread and butter stuff.
I much more question your reaction to the joke than using it as a hypothetical interview question. I actually think it's good. And if it filters out people that have that kind of reaction then it's excellent. No one wants to work with the incurious.
If it was framed as "show us how you would break down this problem and think about it" then sure. If it's the gotcha quiz (much more common in my experience) then no.
But if that's what they were going for it should be something on a completely different and more abstract topic like "develop a method for emptying your swimming pool without electricity in under four hours"
It has nothing to do with “incurious”. Being asked to draw the architecture for something that is abstracted away from your actual job is a dickhead move because it’s just a test for “do you have the same interests as me?”
It’s no different than asking for the architecture of the power supply or the architecture of the network switch that serves the building. Brilliant software engineers are going to have gaps on non-software things.
Poe's Law [1]:
> Without a clear indicator of the author's intent, any parodic or sarcastic expression of extreme views can be mistaken by some readers for a sincere expression of those views.
[1] https://en.wikipedia.org/wiki/Poe%27s_law
That's reasonable in many cases, but I've had situations like this for senior UI and frontend positions, and they: don't ask UI or frontend questions. And ask their pet low level questions. Some even snort that it's softball to ask UI questions or "they use whatever". It's like, yeah no wonder your UI is shit and now you are hiring to clean it up.
Raises hand.
Absolutely. A technically correct bike is very hard to draw in SVG without going overboard in details
Its not. There are thousands of examples on the internet but good SVG sites do have monetary blocks.
https://www.freepik.com/free-photos-vectors/bicycle-svg
Several of those have incorrect frames:
https://www.freepik.com/free-vector/cyclist_23714264.htm
https://www.freepik.com/premium-vector/bicycle-icon-black-li...
Or missing/broken pedals:
https://www.freepik.com/premium-vector/bicycle-silhouette-ic...
https://www.freepik.com/premium-vector/bicycle-silhouette-ve...
http://freepik.com/premium-vector/bicycle-silhouette-vector-...
From smaller to larger nitpick, there's basically something wrong with all of the first 15 or so of these drawings. Thanks for agreeing :)
I'm not positive I could draw a technically correct bike with pen and paper (without a reference), let alone with SVG!
I just had an idea for an RLVR startup.
Yes, but obviously AGI will solve this by, _checks notes_ more TerraWatts!
The word is terawatts unless you mean earth-based watts. OK then, it's confirmed, data centers in space!
…in space!
here the animated version https://claude.ai/public/artifacts/3db12520-eaea-4769-82be-7...
That's hilarious. It's so close!
They trained for it. That's the +0.1!
There's no way they actually work on training this.
I suspect they're training on this.
I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.
https://i.imgur.com/UvlEBs8.png
It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.
Having briefly worked in the 3D Graphics industry, I don't even remotely trust benchmarks anymore. The minute someone's benchmark performance becomes a part of the public's purchasing decision, companies will pull out every trick in the book--clean or dirty--to benchmaxx their product. Sometimes at the expense of actual real-world performance.
Pelicans don’t ride bikes. You can’t have scruples about whether or not the image of a pelican riding a bike has arms.
Wouldn’t any decent bike-riding pelican have a bike tailored to pelicans and their wings?
Sure, that’s one solution. You could also Isle of Dr Moreau your way to a pelican that can use a regular bike. The sky is the limit when you have no scruples.
Now that would be a smart chat agent.
I don't think that really proves anything, it's unsurprising that recumbent bicycles are represented less in the training data and so it's less able to produce them.
Try something that's roughly equally popular, like a Turkey riding a Scooter, or a Yak driving a Tractor.
Interesting that it seems better. Maybe something about adding a highly specific yet unusual qualifier focusing attention?
perhaps try a penny farthing?
There is no way they are not training on this.
I suspect they have generic SVG drawing that they focus on.
The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?
$200 * 1,000 = $200k/month.
I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.
Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?
Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?
Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)
I've stuck with "Generate an SVG of a pelican riding a bicycle" because it's the same prompt I've been using for over a year now and I want results that are sort-of comparable to each other.
I think when I first tried this I iterated a few times to get to something that reliably output SVG, but honestly I didn't keep the notes I should ahve.
This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.
They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.
What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.
As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.
If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!
This really is my favorite benchmark
Well, the clouds are upside-down, so I don't think I can give it a pass.
Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?
I suppose the pelican must be now specifically trained for, since it's a well-known benchmark.
best pelican so far would you say? Or where does it rank in the pelican benchmark?
In other words, is it a pelican or a pelican't?
You’ve been sitting on that pun just waiting for it to take flight
What about the Pelo2 benchmark? (the gray bird that is not gray)
do you have a gif? i need an evolving pelican gif
A pelican GIF in a Pelican(TM) MP4 container.
Pretty sure at this point they train it on pelicans
Can it draw a different bird on a bike?
Here's a kākāpō riding a bicycle instead: https://gist.github.com/simonw/19574e1c6c61fc2456ee413a24528...
I don't think it quite captures their majesty: https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D
Now that I've looked it all up, I feel like that's much more accurate to a real kākāpō than the pelican is to a real pelican. It's almost as if it thinks a pelican is just a white flamingo with a different beak.
The ears on top are a cute touch
Except for both its legs being on the same side of the bike.
I'm firing all of my developers this afternoon.
Opus 6 will fire you instead for being too slow with the ideas.
Too late. You’ve already been fired by a moltbot agent from your PHB.
I asked
> Can you find an academic article that _looks_ legitimate -- looks like a real journal, by researchers with what look like real academic affiliations, has been cited hundreds or thousands of times -- but is obviously nonsense, e.g. has glaring typos in the abstract, is clearly garbled or nonsensical?
It pointed me to a bunch of hoaxes. I clarified:
> no, I'm not looking for a hoax, or a deliberate comment on the situation. I'm looking for something that drives home the point that a lot of academic papers that look legit are actually meaningless but, as far as we can tell, are sincere
It provided https://www.sciencedirect.com/science/article/pii/S246802302....
Close, but that's been retracted. So I asked for "something that looks like it's been translated from another language to english very badly and has no actual content? And don't forget the cited many times criteria. " And finally it told me that the thing I'm looking for probably doesn't exist.
For my tastes telling me "no" instead of hallucinating an answer is a real breakthrough.
> For my tastes telling me "no" instead of hallucinating an answer is a real breakthrough.
It's all anecdata--I'm convinced anecdata is the least bad way to evaluate these models, benchmarks don't work--but this is the behavior I've come to expect from earlier Claude models as well, especially after several back and forth passes where you rejected the initial answers. I don't think it's new.
I can concur that previous models would say "No, that isn't possible" or "No, that doesn't exist". There was one time where I asked it to update a Go module from version X.XX to version X.YY and it would refuse to do so because version X.YY "didn't exist". This back with 3.7 if I recall, and to be clear, that version was released before its knowledge cut off.
I wish I remembered the exact versions involved. I mostly just recall how pissed I was that it was fighting me on changing a single line in my go.mod.
alas, 4.5 often hallucinates academic papers or creates false quotes. I think it's better at knowing that coding answers have deterministic output and being firm there.
It would have to have been trained on the papers without being aware of retractions for that test to work. Otherwise it will be limited to whatever papers it gets from a search engine query, which likely won't contain any un-retracted illegitimate papers.
I once worked at a crypto company that outsourced some of its marketing work to a content marketing firm. A piece that firm submitted to us contained a link to an "academic" article about global poverty with a totally garbled abstract and absolutely no content whatsoever. I don't know how they found it, because when I search google scholar for a subject, usually the things that come back aren't so blatantly FUBAR. I was hoping Claude could help me find something like that for a point I was making in a blogpost about BS in scientific literature (https://regressiontothemeat.substack.com/p/how-i-read-studie...).
The articles it provided where the AI prompts were left in the text were definitely in the right ballpark, although I do wonder if chatbots mean, going forward, we'll see fewer errors in the "WTF are you even talking about" category which, I must say, were typically funnier and more interesting than just the generic blather of "what a great point. It's not X -- it's Y."
When Claude does WebSearch it can delegate it to a sub agent which of it ran in the background will write the entire prompt on a local file and the results. If that happened, I would like to know what it gave you for that. It is always very interesting to know the underlying "recall" of such things. Because often it's garbage in garbage out.
The location might still be on your disk if you can pull up the original Claude JSOn and put it through some `jq` and see what pages it went through to give you and what it did.
this is via the chat, but it did provide this
> The user wants a real, sincere academic paper that looks legitimate — proper journal, real authors with real affiliations, many citations — but is obviously nonsense or garbled. Not a hoax or deliberate prank, but something that was genuinely submitted and published, yet is clearly meaningless if you actually read it. Let me think about what I know and search for examples. There are some famous cases:
> Papers generated by AI/GPT that slipped through - there have been reports of papers with phrases like "Certainly, here is a possible introduction for your topic" or "As an AI language model..." > There have been papers with obvious tortured phrases - "sham" papers that use synonym substitution to avoid plagiarism detection, producing phrases like "sheep pursuing" instead of "shepherd" or "big data" becoming "large information" I'm also thinking of papers that made it through peer review despite being complete gibberish, like the famous SCIgen computer-generated papers that fooled conferences, or papers where the abstract and content don't match at all.
> Now I'm trying to recall specific real examples that fit this criteria — published papers that are genuinely incoherent or nonsensical but appeared in legitimate venues.
This is maybe not the _exhaustive_ search strategy I would have liked, but it's probably something you would do if you were limited to just a certain amount of effort on a task.
Results from a one-shot approach quickly converge on the default “none found” outcome when reasoning isn’t grounded in a paper corpus via proper RAG tooling.
Can you provide more context to your statement? Are you talking about models in general? Or specific recent models? I'm assuming "one-shot approach" is how you classify the parent comment's question (and subsequent refined versions of it).
Well, if there are papers that match your criteria, it's hallucinating the "no".
It might be wrong but that’s not really a hallucination.
Edit: to give you the benefit of doubt, it probably depends on whether the answer was a definitive “this does not exist” or “I couldn’t find it and it may not exist”
claude said "I want to be straight with you: after extensive searching, I don't think the exact thing you're describing — a single paper that is obviously garbled/badly translated nonsense with no actual content, yet has accumulated hundreds or thousands of citations — exists as a famous, easily linkable example."
And there are: https://en.wikipedia.org/wiki/Sokal_affair
> no, I'm not looking for a hoax, or a deliberate comment on the situation. I'm looking for something that drives home the point that a lot of academic papers that look legit are actually meaningless but, as far as we can tell, are sincere
The Sokal paper was a hoax so it doesn’t meet the criteria.
The fact that it got published means there is at least one whole journal full of that
That's still less leaned toward blatant lies like "yes, here is a list" and a doomacroll size of garbage litany.
Actually "no, this is not something within the known corpus of this LLM, or the policy of its owners prevent to disclose it" would be one of the most acceptable answer that could be delivered, which should cover most cases in honest reply.
4.6 is a beast.
Everything in plan mode first + AskUserQuestionTool, review all plans, get it to write its own CLAUDE.md for coding standards and edit where necessary and away you go.
Seems noticeably better than 4.5 at keeping the codebase slim. Obviously it still needs to be kept an eye on, but it's a step up from 4.5.
> We build Claude with Claude. Our engineers write code with Claude Code every day
well that explains quite a bit
CC has >6000 open issues, despite their bot auto-culling them after 60 days of inactivity. It was ~5800 when I looked just a few days ago so they seem to be accelerating towards some kind of bug singularity.
Just anecdotally, each release seems to be buggier than the last.
To me, their claim that they are vibe coding Claude code isn’t the flex they think it is.
I find it harder and harder to trust anthropic for business related use and not just hobby tinkering. Between buggy releases, opaque and often seemingly glitches rate limits and usage limits, and the model quality inconsistency, it’s just not something I’d want to bet a business on.
I think I would be much more frightened if it were working well.
Exactly, thank goodness it's still a bit rubbish in some aspects
Doesn’t this just exacerbate the “black box” conundrum if they just keep piling on more and more features without fully comprehending what’s being implemented
Since version 2.1.9, performance has degraded significantly after extended use. After 30-40 prompts with substantial responses, memory usage climbs above 25GB, making the tool nearly unusable. I'm updating again to see if it improves.
Unlike what another commenter suggested, this is a complex tool. I'm curious whether the codebase might eventually reach a point where it becomes unfixable; even with human assistance. That would be an interesting development. We'll see.
plot twist, it's all claude code instances submitting bug reports on behalf of end users.
I literally hit a claude code bug today, tried to use claude desktop to debug it which didn't help and it offered to open a bug report for me. So yes 100%. Some of the titles also make it pretty clear they are auto submitted. This is my favorite which was around the top when I was creating my bug report 3 hours ago and is now 3 pages back lol.
> Unable to process - no bug report provided. Please share the issue details you'd like me to convert into a GitHub issue title
https://github.com/anthropics/claude-code/issues/23459
It's Claude, all the way down.
Insane to think that a relatively simple CLI tool has so many open issues...
It's not really a simple CLI tool though it's really interactive.
Well part of the issue is that it isn't actually a CLI tool. It takes control of the whole terminal and then badly reimplements a CLI...
What’s so simple about it?
I said relatively simple. It is mostly an API interface with Anthropic models, with tool calling on top of it, very simple input and output.
With extensibility via plugins, MCP (stdio and http), UI to prompt the user for choices and redirection, tools to manage and view context, and on and on.
It is not at all a small app, at least as far as UX surface area. There are, what, 40ish slash commands? Each one is an opportunity for bugs and feature gaps.
I would still call that small, maybe medium. emacs is huge as far as CLI tools go, awk is large because it implements its own language (apparently capable of writing Doom in). `top` probably has a similar number of interaction points, something like `lftp` might have more between local and remote state.
The complex and magic parts are around finding contextual things to include, and I'd be curious how many are that vs "forgot to call clear() in the TUI framework before redirecting to another page".
Also it's highly multithreaded / multiprocess - you can run subagents that can communicate with each other, you can interrupt it while it's in the middle of thinking and it handles it gracefully without forgetting what it was doing
They wouldn't have 6000 issues if they hired one or two Rust engineers.
If I would get a dollar each time a developer (or CTO!) told me "this is (relatively) simple, it will take 2 days/weeks", but then it actually took 2 years+ to fully build and release a product that has more useful features than bugs...
I am not protecting anthropic[0], but how come in this forum every day I still see these "it's simple" takes from experienced people - I have no idea. There are who knows how many terminal emulators out there, with who knows how many different configurations. There are plugins for VSCode and various other editors (so it's not only TUI).
Looking at issue tracker ~1/3 of issues are seemingly feature requests[1].
Do not forget we are dealing with LLMs and it's a tool, which purpose and selling point that it codes on ANY computer in ANY language for ANY system. It's very popular tool run each day by who knows how many people - I could easily see, how such "relatively simple" tool would rack up thousands of issues, because "CC won't do weird thing X, for programming language Y, while I run from my terminal Z". And because it's LLM - theres whole can of non deterministic worms.
Have you created an LLM agent, especially with moderately complex tool usage? If yes and it worked flawlessly - tell your secrets (and get hired by Anthropic/ChatGPT/etc). Probably 80% of my evergrowing code was trying to just deal with unknown unknowns - what if LLM invokes tool wrong? How to guide LLM back on track? How to protect ourselves and keep LLM on track if prompts are getting out of hand or user tries to do something weird? The problems were endless...
Yes the core is "simple", but it's extremely deep can of worms, for such successful tool - I easily could see how there are many issues.
Also super funny, that first issue for me at the moment is how user cannot paste images when it has Korean language input (also issue description is in Korean) and second issue is about input problems in Windows Powershell and CMD, which is obviously total different world compared to POSIX (???) terminal emulators.
[0] I have very adverse feelings for mega ultra wealthy VC moneys...
[1] https://github.com/anthropics/claude-code/issues?q=is%3Aissu...
Although I understand your frustration (and have certainly been at the other side of this as well!), I think its very valuable to always verbalize your intuition of scope of work and be critical if your intuition is in conflict with reality.
Its the best way to find out if there's a mismatch between value and effort, and its the best way to learn and discuss the fundamental nature of complexity.
Similar to your argument, I can name countless of situations where developers absolutely adamantly insisted that something was very hard to do, only for another developer to say "no you can actually do that like this* and fix it in hours instead of weeks.
Yes, making a TUI from scratch is hard, no that should not affect Claude code because they aren't actually making the TUI library (I hope). It should be the case that most complexity is in the model, and the client is just using a text-based interface.
There seems to be a mismatch of what you're describing would be issues (for instance about the quality of the agent) and what people are describing as the actual issues (terminal commands don't work, or input is lost arbitrarily).
That's why verbalizing is important, because you are thinking about other complexities than the people you reply to.
As another example `opencode`[0] has number issues on the same order of magnitude, with similar problems.
> There seems to be a mismatch of what you're describing would be issues (for instance about the quality of the agent) and what people are describing as the actual issues (terminal commands don't work, or input is lost arbitrarily).
I just named couple examples I've seen in issue tracker and `opencode` on quick skim has many similar issues about inputs and rendering issues in terminals too.
> Similar to your argument, I can name countless of situations where developers absolutely adamantly insisted that something was very hard to do, only for another developer to say "no you can actually do that like this* and fix it in hours instead of weeks.
Good example, as I have seen this too, but for this case, let's first see `opencode`/`claude` equivalent written in "two weeks" and that has no issues (or issues are fixed so fast, they don't accumulate into thousands) and supports any user on any platform. People building stuff for only themselves (N=1) and claiming the problem is simple do not count.
---------
Like the guy two days ago claiming that "the most basic feature"[1] in an IDE is a _terminal_. But then we see threads in HN popping up about Ghostty or Kitty or whatever and how those terminals are god-send, everything else is crap. They may be right, but that software took years (and probably tens of man-years) to write.
What I am saying is that just throwing out phrases that something is "simple" or "basic" needs proof, but at the time of writing I don't see examples.
[0] https://github.com/anomalyco/opencode/issues
[1] https://news.ycombinator.com/item?id=46877204
> equivalent written in "two weeks"
This is indeed a nonsensical timeframe.
> What I am saying is that just throwing out phrases that something is "simple" or "basic" needs proof, but at the time of writing I don't see examples.
Fair point.
> > equivalent written in "two weeks"
> This is indeed a nonsensical timeframe.
Sorry - I should have explained that it's an ironic hyperbole. Was thinking quotes will be enough, but Poe's law strikes again.
I’m pretty certain you haven’t used it yet(to its fullest extent) then. Claude Code is easily one of the most complex terminal UIs I have seen yet.
It’s extremely simple.
If that’s the most complex TUI (yeah, new acronym) you’ve seen, you have a lot to catch up on!
I am talking rendering image/video in the terminal!
Could you explain why? When I think about complex TUIs, I think about things we were building with Turbo Vision in the 90s.
I’m going to buck the trend and say it’s really not that complex. AFAIK they are using Ink, which is React with a TUI renderer.
Cue I could build it in a weekend vibes, I built my own agent TUI using the OpenAI agent SDK and Ink. Of course it’s not as fleshed out as Claude, but it supports git work trees for multi agent, slash commands, human in the loop prompts and etc. If I point it at the Anthropic models it more or less produces results as m good as the real Claude TUI.
I actually “decompiled” the Claude tools and prompts and recreated them. As of 6 months ago Claude was 15 tools, mostly pretty basic (list for, read file, wrote file, bash, etc) with some very clever prompts, especially the task tool it uses to do the quasi planning mode task bullets (even when not in planning mode).
Honestly the idea of bringing this all together with an affordable monthly service and obviously some seriously creative “prompt engineers” is the magic/hard part (and making the model itself, obviously).
It's far from simple
sips coffee… ahh yes, let me find that classic Dropbox rsync comment
Just because Antropic made you think they are doing very complex thing with this tool, doesn't mean it is true. Claude Code is not even comparable to massive software which is probably an order of magnitudes more complex, such as IntelliJ stuff as an example.
Tools like https://github.com/badlogic/pi-mono implement most of the functionality Claude Code has, even adding loads of stuff Claude doesn't have and can actually scroll without flickering inside terminal, all built by a single guy as a side project. I guess we can't ask that much from a 250B USD company.
Be careful with the coffee.
The rate of Issues opened on a popular repo is at least one order of magnitude beyond the number of Issues whoever is able to deal with them can handle.
Half of them were probably opened yesterday during the Claude outage.
Nah, it was at like 5500 before.
It explains how important dogfooding is if you want to make an extremely successful product.
It’s extremely successful, not sure what it explains other than your biases
Microsoft's products are also extremely successful
they're also total garbage
but they have the advantage of already being a big company. Anthropic is new and there's no reason for people to use it
The tool is absolutely fantastic coding assistant. That's why I use it.
The amount of non-critical bugs all over the place is at least a magnitude larger than of any software I was using daily ever.
Plenty of built in /commands don't work. Sometimes it accepts keystrokes with 1 second delays. It often scrolls hundreds of lines in console after each key stroke Every now and then it crashes completely and is unrecoverable (I once have up and installed a fresh wls) When you ask it question in plan mode it is somewhat of an art to find the answer because after answering the question it will dump the whole current plan (free screens of text)
And just in general the technical feeling of the TUI is that of a vibe coded project that got too big to control.
I think this might be a harbinger of what we should expect for software quality in the next decade
Orrrrr it’s not
what about if management gives them a reason? You can think of which those can be.
Well there you have it, proof you’re not being reasonable. Microsoft’s products annoy HN users but they are absolutely not total garbage. They’re highly functional and valuable and if they weren’t they truely wouldn’t be used, they’re just flawed.
You should look at some Copilot reviews.
Claude is by far the most popular and best assistant currently available for a developer.
Okay, and Windows is by far the most popular desktop operating system.
Discussions are pointless when the parties are talking past each other.
Yeah, but there are dozens of AI coding assistants to choose from, and the cost to switch is very low, unlike switching operating systems.
I've tried them all and I keep coming back to Claude Code because it's just so much more capable and useful than the others.
Yes, and windows is pretty good for most people. Don’t be ridiculous.
Popular meaning lots of people like it or that it is relatively widespread? Polio used to be popular in the latter way.
I like windows, it’s fine. I like MacOS better. I like Linux. None of them are garbage or unusable.
have you used Windows 11?
file explorer takes 5 seconds to open
No it doesn’t, don’t be hyperbolic.
might be only among most popular. https://skills.sh/ is some data point.
Is it better than OpenCode?
Something being successful and something being a high quality product with good engineering are two completely different questions.
Anthropic has perhaps the most embarrassing status page history I have ever seen. They are famous for downtime.
https://status.claude.com/
As opposed to other companies which are smart enough not to report outages.
So, there are only two types of companies: ones that have constant downtime, and ones that have constant downtime but hide it, right?
Basically, yes.
The competition doesn't currently have all 99's - https://status.openai.com/
The best way to use Claude's models seems to be some other inference provider (either OpenRouter or directly)
Shades of Fail Whale
And yet people still use them.
The sandboxing in CC is an absolute joke, it's no wonder there's an explosion of sandbox wrappers at the moment. There's going to be a security catastrophe at some point, no doubt about it.
Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)
I did some debugging on this today. The results are... sobering.
Memory comparison of AI coding CLIs (single session, idle):
That's a 24x to 50x difference for tools that do the same thing: send text to an API.vmmap shows Claude Code reserves 32.8 GB virtual memory just for the V8 heap, has 45% malloc fragmentation, and a peak footprint of 746 MB that never gets released, classic leak pattern.
On my 16 GB Mac, a "normal" workload (2 Claude sessions + browser + terminal) pushes me into 9.5 GB swap within hours. My laptop genuinely runs slower with Claude Code than when I'm running local LLMs.
I get that shipping fast matters, but building a CLI with React and a full Node.js runtime is an architectural choice with consequences. Codex proves this can be done in 15 MB. Every Claude Code session costs me 360+ MB, and with MCP servers spawning per session, it multiplies fast.
Jarred Sumner (bun creator, bun was recently acquired by Anthropic) has been working exclusively on bringing down memory leaks and improving performance in CC the last couple weeks. He's been tweeting his progress.
This is just regular tech debt that happens from building something to $1bn in revenue as fast as you possibly can, optimize later.
They're optimizing now. I'm sure they'll have it under control in no time.
CC is an incredible product (so is codex but I use CC more). Yes, lately it's gotten bloated, but the value it provides makes it bearable until they fix it in short time.
if I had a dollar for each time I heard “until they fix it in short time” I’d have Elon money
I believe they use https://bun.com/ Not Node.js
OpenCode is not written in Go. It's TS on Bun, with OpenTUI underneath which is written in Zig.
why do you care about uncommitted virtual memory? that's practically infinite
Sounds like a web developer defined the solution a year before they knew what the problem was.
Nah. It’s just web development languages are a better fit for agentic coding presently. They weighed the pros and cons, they’re not stupid.
Of course they can be stupid, hubris is a real thing and humans fail all the time.
I’ve had good success with Claude building snappy TUIs in Rust with Ratatui.
It’s not obvious to me that there’d be any benefit of using TypeScript and React instead, especially none that makes up for the huge downsides compared to Rust in a terminal environment.
Seems to me the problem is more likely the skills of the engineers, not Claude’s capabilities.
It's a popular myth, but not really true anymore with the latest and greatest. I'm currently using both Claude and Codex to work on a Haskell codebase, and it works wonderfully. More so than JS actually, since the type system provides extensive guardrails (you can get types with TS, but it's not sound, and it's very easy to write code that violates type constraints at runtime without even deliberately trying to do so).
There’s nothing wrong with that, except it lets ai skeptics feel superior
https://www.youtube.com/watch?v=LvW1HTSLPEk
I thought this was a solid take
interesting
There are absolutely things wrong with that, because React was designed to solve problems that don't exist in a TUI.
React fixes issues with the DOM being too slow to fully re-render the entire webpage every time a piece of state changes. That doesn't apply in a TUI, you can re-render TUIs faster than the monitor can refresh. There's no need to selectively re-render parts of the UI, you can just re-render the entire thing every time something changes without even stressing out the CPU.
It brings in a bunch of complexity that doesn't solve any real issues beyond the devs being more familiar with React than a TUI library.
It is demonstrably absolutely fine. Sheesh.
It’s fine in the sense that it works, it’s just a really bad look for a company building a tool that’s supposed to write good code because it balloons the resources consumed up to an absurd level.
300MB of RAM for a CLI app that reads files and makes HTTP calls is crazy. A new emacs GUI instance is like 70MB and that’s for an entire text editor with a GUI.
I haven't looked at it directly, so I can speak on quality, but it's a pretty weird way to write a terminal app
It’s unusual but it’s a better fit for agentic coding so it makes sense
Oh come on. It's massively wrong. It is always wrong. It's not always wrong enough to be important, but it doesn't stop being wrong
No, it’s not.
You should elaborate. What are your criteria and why do you think they should matter to actual users?
I use AI and I can call AI slop shit if it smells like shit.
And this doesn’t.
Same with opencode and gemini, it's disgusting
Codex (by openai ironically) seems to be the fastest/most-responsive, opens instantly and is written in rust but doesn't contain that many features
Claude opens in around 3-4 seconds
Opencode opens in 2 seconds
Gemini-cli is an abomination which opens in around 16 second for me right now, and in 8 seconds on a fresh install
Codex takes 50ms for reference...
--
If their models are so good, why are they not rewriting their own react in cli bs to c++ or rust for 100x performance improvement (not kidding, it really is that much)
Great question, and my guess:
If you build React in C++ and Rust, even if the framework is there, you'll likely need to write your components in C++/Rust. That is a difficult problem. There are actually libraries out there that allow you to build web UI with Rust, although they are for web (+ HTML/CSS) and not specifically CLI stuff.
So someone needs to create such a library that is properly maintained and such. And you'll likely develop slower in Rust compared to JS.
These companies don't see a point in doing that. So they just use whatever already exists.
Opencode wrote their own tui library in zig, and then build a solidjs library on top of that.
https://github.com/anomalyco/opentui
This has nothing to do with React style UI building.
I am referring to your comment that the reason they use js is because of a lack of tui libraries in lower level languages, yet opencode chose to develop their own in zig and then make binding for solidjs.
Those Rust libraries have existed for some time:
- https://github.com/ratatui/ratatui
- https://github.com/ccbrown/iocraft
- https://crates.io/crates/dioxus-tui
Where is React? These are TUI libraries, which are not the same thing
iocraft and dioxus-tui implement the React model, or derivatives of it.
Looking at their examples, I imagine people who have written HTML and React before can't possibly use these libraries without losing their sanity.
That's not a criticism of these frameworks -- there are constraints coming from Rust and from the scope of the frameworks. They just can't offer a React like experience.
But I am sure that companies like Anthropic or OpenAI aren't going to build their application using these libraries, even with AI.
and why do they need react...
That's actually relatively understandable. The React model (not necessarily React itself) of compositional reactive one-way data binding has become dominant in UI development over the last decade because it's easy to work with and does not require you to keep track of the state of a retained UI.
Most modern UI systems are inspired by React or a variant of its model.
Is this accurate? I've been coding UIs since the early 2000s and one-way data binding has always been a thing, especially in the web world. Even in the heyday of jQuery, there were still good (but much less popular) libraries for doing it. The idea behind it isn't very revolutionary and has existed for a long time. React is a paradigm shift because of differential rendering of the DOM which enabled big performance gains for very interactive SPAs, not because of data binding necessarily.
Well said.
Why does it matter if Claude Code opens in 3-4 seconds if everything you do with it can take many seconds to minutes? Seems irrelevant to me.
I guess with ~50 years of CPU advancements, 3-4 seconds for a TUI to open makes it seem like we lost the plot somewhere along the way.
Don’t forget they’ve also publicly stated (bragged?) about the monumental accomplishment of getting some text in a terminal to render at 60fps.
So it doesn’t matter at all except to your sensibilities. Sounds to me that they simply are much better at prioritisation than your average HN user, who’d have taken forever to release it but at least the terminal interface would be snappy…
Some people[0] like their tools to be well engineered. This is not unique to software.
[0] Perhaps everyone who actually takes pride in their craft and doesn’t prioritise shitty hustle culture and making money over everything else.
Aside from startup time, as a tool Claude Code is tremendous. By far the most useful tool I’ve encountered yet. This seems to be very nit picky compared to the total value provided. I think y'all are missing the forrest for the trees.
Most of the value of Claude Code comes from the model, and that's not running on your device.
The Claude Code TUI itself is a front end, and should not be taking 3-4 seconds to load. That kind of loading time is around what VSCode takes on my machine, and VSCode is a full blown editor.
Because when the agent is taking many seconds to minutes, I am starting new agents instead of waiting or switching to non-agent tasks
This is exactly the type of thing that AI code writers don't do well - understand the prioritization of feature development.
Some developers say 3-4 seconds are important to them, others don't. Who decides what the truth is? A human? ClawdBot?
> Some developers say 3-4 seconds are important to them, others don't.
Wasnt GTA 5 famous for very long start up time and turns out there some bug which some random developer/gamer found out and gave them a fix?
Most Gamers didnt care, they still played it.
The humans in the company (correctly) realised that a few seconds to open basically the most powerful productivity agent ever made so they can focus on fast iteration of features is a totally acceptable trade off priority wise. Who would think differently???
This is my point...
codex cli is missing a bunch of ux features like resizing on terminal size change.
Opencode's core is actually written in zig, only ui orchestration is in solidjs. It's only slightly slower to load than neo-vim on my system.
https://github.com/anomalyco/opentui
Codex team made the right call to rewrite its TypeScript to Rust early on
50ms to open and then 2hrs to solve a simple problem vs 4s to open and then 5m to solve a problem, eh?
lol right? I feel like I’m taking crazy pills here. Why do people here want to prioritise the most pointless things? Oh right it’s because they’re bitter and their reaction is mostly emotional…
Is this a react feature or did they build something to translate react to text for display in the terminal?
React, the framework, is separate from react-dom, the browser rendering library. Most people think of those two as one thing because they're the most popular combo.
But there are many different rendering libraries you can use with React, including Ink, which is designed for building CLI TUIs..
Anyone that knows a bit about terminals would already know that using React is not a good solution for TUI. Terminal rendering is done as a stream of characters which includes both the text and how it displays, which can also alter previously rendered texts. Diffing that is nonsense.
You’re not diffing that, though. The app keeps a virtual representation of the UI state in a tree structure that it diffs on, then serializes that into a formatted string to draw to the out put stream. It’s not about limiting the amount of characters redrawn (that would indeed be nonsense), but handling separate output regions effectively.
They used Ink: https://github.com/vadimdemedes/ink
I've used it myself. It has some rough edges in terms of rendering performance but it's nice overall.
Thats pretty interesting looking, thanks!
Not a built-in React feature. The idea been around for quite some time, I came across it initially with https://github.com/vadimdemedes/ink back in 2022 sometime.
i had claude make a snake clone and fix all the flickering in like 20 minutes with the library mentioned lol
Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)
Who cares, and why?
All of the major providers' CLI harnesses use Ink: https://github.com/vadimdemedes/ink
React's core is agnostic when it comes to the actual rendering interface. It's just all the fancy algos for diffing and updating the underlying tree. Using it for rendering a TUI is a very reasonable application of the technology.
The terminal UI is not a tree structure that you can diff. It’s a 2D cells of characters, where every manipulation is a stream of texts. Refreshing or diffing that makes no sense.
When doing advanced terminal UI, you might at some point have to layout content inside the terminal. At some point, you might need to update the content of those boxes because the state of the underlying app has changed. At that point, refreshing and diffing can make sense. For some, the way React organizes logic to render and update an UI is nice and can be used in other contexts.
How big is the UI state that it makes sense to bring in React and the related accidental complexity? I’m ready to bet that no TUI have that big of a state.
IMO diffing might have made sense to do here, but that's not what they chose to do.
What's apparently happening is that React tells Ink to update (re-render) the UI "scene graph", and Ink then generates a new full-screen image of how the terminal should look, then passes this screen image to another library, log-update, to draw to the terminal. log-update draws these screen images by a flicker-inducing clear-then-redraw, which it has now fixed by using escape codes to have the terminal buffer and combine these clear-then-redraw commands, thereby hiding the clear.
An alternative solution, rather than using the flicker-inducing clear-then-redraw in the first place, would have been just to do terminal screen image diffs and draw the changes (which is something I did back in the day for fun, sending full-screen ASCII digital clock diffs over a slow 9600baud serial link to a real terminal).
Any diff would require to have a Before and an After. Whatever was done for the After can be done to directly render the changes. No need for the additional compute of a diff.
Sure, you could just draw the full new screen image (albeit a bit inefficient if only one character changed), and no need for the flicker-inducing clear before draw either.
I'm not sure what the history of log-output has been or why it does the clear-before-draw. Another simple alternative to pre-clear would have been just to clear to end of line (ESC[0K) after each partial line drawn.
Only in the same way that the pixels displayed in a browser are not a tree structure that you can diff - the diffing happens at a higher level of abstraction than what's rendered.
Diffing and only updating the parts of the TUI which have changed does make sense if you consider the alternative is to rewrite the entire screen every "frame". There are other ways to abstract this, e.g. a library like tqmd for python may well have a significantly more simple abstraction than a tree for storing what it's going to update next for the progress bar widget than claude, but it also provides a much more simple interface.
To me it seems more fair game to attack it for being written in JS than for using a particular "rendering" technique to minimise updates sent to the terminal.
Most UI library store states in tree of components. And if you’re creating a custom widget, they will give you a 2D context for the drawing operations. Using react makes sense in those cases because what you’re diffing is state, then the UI library will render as usual, which will usually be done via compositing.
The terminal does not have a render phase (or an update state phase). You either refresh the whole screen (flickering) or control where to update manually (custom engine, may flicker locally). But any updates are sequential (moving the cursor and then sending what to be displayed), not at once like 2D pixel rendering does.
So most TUI only updates when there’s an event to do so or at a frequency much lower than 60fps. This is why top and htop have a setting for that. And why other TUI software propose a keybind to refresh and reset their rendering engines.
The "UI" is indeed represented in memory in tree-like structure for which positioning is calculated according to a flexbox-like layout algo. React then handles the diffing of this structure, and the terminal UI is updated according to only what has changed by manually overwriting sections of the buffer. The CLI library is called Ink and I forget the name of the flexbox layout algo implementation, but you can read about the internals if you look at the Ink repo.
It’s really not that crazy.
React itself is a frontend-agnostic library. People primarily use it for writing websites but web support is actually a layer on top of base react and can be swapped out for whatever.
So they’re really just using react as a way to organize their terminal UI into components. For the same reason it’s handy to organize web ui into components.
And some companies use it to write start menus.
What does it explain, oh snark master supreme?
Ah yes, explains why it takes 3 seconds for a new chat to load after I click new chat in the macOS app.
Can Claude fix the flicker in Claude yet?
They are also giving away $50 extra pay as you go credit to try Opus 4.6. I just claimed it from the web usage page[1]. Are they anticipating higher token usage for the model or just want to promote the usage?
[1] https://claude.ai/settings/usage
"Page not found" for me. I assume this is for currently paying accounts only or something (my subscription hasn't been active for a while), which is fair.
Yes, I'm on a paid subscription.
Based on email from Antrhopic, I’ve expected to get this automatically. I’ve met their conditions. Searching this thread for “50” got me to your comment and link worked. Thanks HN friend!
Haha! Glad it was helpful. Yes, I keep an eye on that page, so I was quick to notice.
Damn this is awesome. I have some heavy PRs to crunch through.
So thats 2M tokens for free basically?
Thanks for the tip!
Glad that it was helpful. Thanks
Does anyone with more insight into the AI/LLM industry happen to know if the cost to run them in normal user-workflows is falling? The reason I'm asking is because "agent teams" while a cool concept, it largely constrained by the economics of running multiple LLM agents (i.e. plans/API calls that make this practical at scale are expensive).
A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.
The cost per token served has been falling steadily over the past few years across basically all of the providers. OpenAI dropped the price they charged for o3 to 1/5th of what it was in June last year thanks to "engineers optimizing inferencing", and plenty of other providers have found cost savings too.
Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.
> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers
Where did you hear that? It doesn't match my mental model of how this has played out.
I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.
> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.
That does not mean the frontier labs are pricing their APIs to cover their costs yet.
It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.
In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.
What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.
It's quite clear that these companies do make money on each marginal token. They've said this directly and analysts agree [1]. It's less clear that the margins are high enough to pay off the up-front cost of training each model.
[1] https://epochai.substack.com/p/can-ai-companies-become-profi...
It’s not clear at all because model training upfront costs and how you depreciate them are big unknowns, even for deprecated models. See my last comment for a bit more detail.
They are obviously losing money on training. I think they are selling inference for less than what it costs to serve these tokens.
That really matters. If they are making a margin on inference they could conceivably break even no matter how expensive training is, provided they sign up enough paying customers.
If they lose money on every paying customer then building great products that customers want to pay for them will just make their financial situation worse.
"We lose money on each unit sold, but we make it up in volume"
By now, model lifetime inference compute is >10x model training compute, for mainstream models. Further amortized by things like base model reuse.
> They've said this directly and analysts agree [1]
chasing down a few sources in that article leads to articles like this at the root of claims[1], which is entirely based on information "according to a person with knowledge of the company’s financials", which doesn't exactly fill me with confidence.
[1] https://www.theinformation.com/articles/openai-getting-effic...
"according to a person with knowledge of the company’s financials" is how professional journalists tell you that someone who they judge to be credible has leaked information to them.
I wrote a guide to deciphering that kind of language a couple of years ago: https://simonwillison.net/2023/Nov/22/deciphering-clues/
Unfortunately tech journalists' judgement of source credibility don't have a very good track record
But there are companies which are only serving open weight models via APIs (ie. they are not doing any training), so they must be profitable? here's one list of providers from OpenRouter serving LLama 3.3 70B: https://openrouter.ai/meta-llama/llama-3.3-70b-instruct/prov...
It's also true that their inference costs are being heavily subsidized. For example, if you calculate Oracles debt into OpenAIs revenue, they would be incredibly far underwater on inference.
Sue, but if they stop training new models, the current models will be useless in a few years as our knowledge base evolves. They need to continually train new models to have a useful product.
> they still are subsidizing inference costs.
They are for sure subsidising costs on all you can prompt packages (20-100-200$ /mo). They do that for data gathering mostly, and at a smaller degree for user retention.
> evidence at all that Anthropic or OpenAI is able to make money on inference yet.
You can infer that from what 3rd party inference providers are charging. The largest open models atm are dsv3 (~650B params) and kimi2.5 (1.2T params). They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range. You can make some educates guesses that they get some leeway for model size at the 10-15$/ Mtok prices for their top tier models. So if they are inside some sane model sizes, they are likely making money off of token based APIs.
> They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range.
The interesting number is usually input tokens, not output, because there's much more of the former in any long-running session (like say coding agents) since all outputs become inputs for the next iteration, and you also have tool calls adding a lot of additional input tokens etc.
It doesn't change your conclusion much though. Kimi K2.5 has almost the same input token pricing as Gemini 3 Flash.
most of those subscriptions go unused. I barely use 10% of mine
so my unused tokens compensate for the few heavy users
Ive been thinking about our company, one of big global conglomerates that went for copilot. Suddenly I was just enrolled.. together with at least 1500 others. I guess the amount of money for our business copilot plans x 1500 is not a huge amount of money, but I am at least pretty convinced that only a small part of users use even 10% of their quota. Even teams located around me, I only know of 1 person that seems to use it actively.
Thanks!
I hope my unused gym subscription pays back the good karma :-)
> I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.
Anthropic planning an IPO this year is a broad meta-indicator that internally they believe they'll be able to reach break-even sometime next year on delivering a competitive model. Of course, their belief could turn out to be wrong but it doesn't make much sense to do an IPO if you don't think you're close. Assuming you have a choice with other options to raise private capital (which still seems true), it would be better to defer an IPO until you expect quarterly numbers to reach break-even or at least close to it.
Despite the willingness of private investment to fund hugely negative AI spend, the recently growing twitchiness of public markets around AI ecosystem stocks indicates they're already worried prices have exceeded near-term value. It doesn't seem like they're in a mood to fund oceans of dotcom-like red ink for long.
>Despite the willingness of private investment to fund hugely negative AI spend
VC firms, even ones the size of Softbank, also literally just don't have enough capital to fund the planned next-generation gigawatt-scale data centers.
IPO'ing is often what you do to give your golden investors an exit hatch to dump their shares on the notoriously idiotic and hype driven public.
> evidence at all that Anthropic or OpenAI is able to make money on inference yet.
The evidence is in third party inference costs for open source models.
But a) that's the cost to the user -- we don't know how much loss they're taking on those and b) the number of tokens to serve a similar prompt has been going up, so that the total cost to serve a prompt has been going up in general. Any cost analysis that doesn't mention these is hugely misleading.
> "engineers optimizing inferencing"
are we sure this is not a fancy way of saying quantization?
When MP3 became popular, people were amazed that you could compress audio to 1/10th its size with minor quality loss. A few decades later, we have audio compression that is much better and higher-quality than MP3, and they took a lot more effort than "MP3 but at a lower bitrate."
The same is happening in AI research now.
> A few decades later, we have audio compression that is much better and higher-quality than MP3
Just curious, which formats and how they compare, storage wise?
Also, are you sure it's not just moving the goalposts to CPU usage? Frequently more powerful compression algorithms can't be used because they use lots of processing power, so frequently the biggest gains over 20 years are just... hardware advancements.
Someone made a quality tracker: https://marginlab.ai/trackers/claude-code/
Or distilled models, or just slightly smaller models but same architecture. Lots of options, all of them conveniently fitting inside "optimizing inferencing".
The o3 optimizations were not quantization, they confirmed this at the time.
A ton of GPU kernels are hugely inefficient. Not saying the numbers are realistic, but look at the 100s of times of gain in the Anthropic performance takehome exam that floated around on here.
And if you've worked with pytorch models a lot, having custom fused kernels can be huge. For instance, look at the kind of gains to be had when FlashAttention came out.
This isn't just quantization, it's actually just better optimization.
Even when it comes to quantization, Blackwell has far better quantization primitives and new floating point types that support row or layer-wise scaling that can quantize with far less quality reduction.
There is also a ton of work in the past year on sub-quadratic attention for new models that gets rid of a huge bottleneck, but like quantization can be a tradeoff, and a lot of progress has been made there on moving the Pareto frontier as well.
It's almost like when you're spending hundreds of billions on capex for GPUs, you can afford to hire engineers to make them perform better without just nerfing the models with more quantization.
"This isn't X, it's Y" with extra steps.
I'm flattered you think I wrote as well as an AI.
lmao
My experience trying to use Opus 4.5 on the Pro plan has been terrible. It blows up my usage very very fast. I avoid it altogether now. Yes, I know they warn about this, but it's comically fast how quickly it happens.
It seems it is true for gemini because they have a humongous sparse model but it isn't so true for the max performance opus-4.5/6 and gpt-5.2/3.
> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers
This gets repeated everywhere but I don't think it's true.
The company is unprofitable overall, but I don't see any reason to believe that their per-token inference costs are below the marginal cost of computing those tokens.
It is true that the company is unprofitable overall when you account for R&D spend, compensation, training, and everything else. This is a deliberate choice that every heavily funded startup should be making, otherwise you're wasting the investment money. That's precisely what the investment money is for.
However I don't think using their API and paying for tokens has negative value for the company. We can compare to models like DeepSeek where providers can charge a fraction of the price of OpenAI tokens and still be profitable. OpenAI's inference costs are going to be higher, but they're charging such a high premium that it's hard to believe they're losing money on each token sold. I think every token paid for moves them incrementally closer to profitability, not away from it.
The reports I remember show that they're profitable per-model, but overlap R&D so that the company is negative overall. And therefore will turn a massive profit if they stop making new models.
* stop making new models and people keep using the existing models, not switch to a competitor still investing in new models.
Doesn’t it also depend on averaging with free users?
I can see a case for omitting R&D when talking about profitability, but training makes no sense. Training is what makes the model, omitting it is like omitting the cost of running the production facility of a car manufacturer. If AI companies stop training they will stop producing models, and they will run out of a products to sell.
The reason for this is that the cost scales with the model and training cadence, not usage and so they will hope that they will be able to scale number of inference tokens sold both by increasing use and/or slowing the training cadence as competitors are also forced to aim for overall profitability.
It is essentially a big game of venture capital chicken at present.
It depends on what you're talking about
If you're looking at overall profitability, you include everything
If you're talking about unit economics of producing tokens, you only include the marginal cost of each token against the marginal revenue of selling that token
I don’t understand the logic. Without training the marginal cost of each token goes into nothing. The more you train, the better the model, and (presumably) you will gain more costumer interest. Unlike R&D you will always have to train new models if you want to keep your customers.
To me this looks likes some creative bookkeeping, or even wishful thinking. It is like if SpaceX omits the price of the satellites when calculating their profits.
> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.
This is obviously not true, you can use real data and common sense.
Just look up a similar sized open weights model on openrouter and compare the prices. You'll note the similar sized model is often much cheaper than what anthropic/openai provide.
Example: Let's compare claude 4 models with deepseek. Claude 4 is ~400B params so it's best to compare with something like deepseek V3 which is 680B params.
Even if we compare the cheapest claude model to the most expensive deepseek provider we have claude charging $1/M for input and $5/M for output, while deepseek providers charge $0.4/M and $1.2/M, a fifth of the price, you can get it as cheap as $.27 input $0.4 output.
As you can see, even if we skew things overly in favor of claude, the story is clear, claude token prices are much higher than they could've been. The difference in prices is because anthropic also needs to pay for training costs, while openrouter providers just need to worry on making serving models profitable. Deepseek is also not as capable as claude which also puts down pressure on the prices.
There's still a chance that anthropic/openai models are losing money on inference, if for example they're somehow much larger than expected, the 400B param number is not official, just speculative from how it performs, this is only taking into account API prices, subscriptions and free user will of course skew the real profitability numbers, etc.
Price sources:
https://openrouter.ai/deepseek/deepseek-v3.2-speciale
https://claude.com/pricing#api
> This is obviously not true, you can use real data and common sense.
It isn't "common sense" at all. You're comparing several companies losing money, to one another, and suggesting that they're obviously making money because one is under-cutting another more aggressively.
LLM/AI ventures are all currently under-water with massive VC or similar money flowing in, they also all need training data from users, so it is very reasonable to speculate that they're in loss-leader mode.
There are companies which are only serving open weight models and not doing any training, so they must be profitable? Check for example this list https://openrouter.ai/meta-llama/llama-3.3-70b-instruct/prov...
Doing some math in my head, buying the GPUs at retail price, it would take probably around half a year to make the money back, probably more depending how expensive electricity is in the area you're serving from. So I don't know where this "losing money" rhetoric is coming from. It's probably harder to source the actual GPUs than making money off them.
> So I don't know where this "losing money" rhetoric is coming from.
https://www.dbresearch.com/PROD/RI-PROD/PROD0000000000611818...
electricity
To borrow a concept of cloud server renting, there's also the factor of overselling. Most open source LLM operators probably oversell quite a bit - they don't scale up resources as fast as OpenAI/Anthropic when requests increase. I notice many openrouter providers are noticeably faster during off hours.
In other words, it's not just the model size, but also concurrent load and how many gpus do you turn on at any time. I bet the big players' cost is quite a bit higher than the numbers on openrouter, even for comparable model parameters.
> i.e. plans/API calls that make this practical at scale are expensive
Local AI's make agent workflows a whole lot more practical. Making the initial investment for a good homelab/on-prem facility will effectively become a no-brainer given the advantages on privacy and reliability, and you don't have to fear rugpulls or VC's playing the "lose money on every request" game since you know exactly how much you're paying in power costs for your overall load.
I don't care about privacy and I didn't have much problems with reliability of AI companies. Spending ridiculous amount of money on hardware that's going to be obsolete in a few years and won't be utilized at 100% during that time is not something that many people would do, IMO. Privacy is good when it's given for free.
I would rather spend money on some pseudo-local inference (when cloud company manages everything for me and I just can specify some open source model and pay for GPU usage).
on prem economics dont work because you can't batch requests. unless you are able to run 100 agents at the same time all the time
> unless you are able to run 100 agents at the same time all the time
Except that newer "agent swarm" workflows do exactly that. Besides, batching requests generally comes with a sizeable increase in memory footprint, and memory is often the main bottleneck especially with the larger contexts that are typical of agent workflows. If you have plenty of agentic tasks that are not especially latency-critical and don't need the absolutely best model, it makes plenty of sense to schedule these for running locally.
Saw a comment earlier today about google seeing a big (50%+) fall in Gemini serving cost per unit across 2025 but can’t find it now. Was either here or on Reddit
From Alphabet 2025 Q4 Earnings call: "As we scale, we’re getting dramatically more efficient. We were able to lower Gemini serving unit costs by 78% over 2025 through model optimizations, efficiency and utilization improvements." https://abc.xyz/investor/events/event-details/2026/2025-Q4-E...
Thanks! That's the one
I think actually working out whether they are losing money is extremely difficult for current models but you can look backwards. The big uncertainties are:
1) how do you depreciate a new model? What is its useful life? (Only know this once you deprecate it)
2) how do you depreciate your hardware over the period you trained this model? Another big unknown and not known until you finally write the hardware off.
The easy thing to calculate is whether you are making money actually serving the model. And the answer is almost certainly yes they are making money from this perspective, but that’s missing a large part of the cost and is therefore wrong.
Gemini-pro-preview is on ollama and requires h100 which is ~$15-30k. Google are charging $3 a million tokens. Supposedly its capable of generating between 1 and 12 million tokens an hour.
Which is profitable. but not by much.
What do you mean it's on ollama and requires h100? As a proprietary google model, it runs on their own hardware, not nvidia.
sorry A lack of context:
https://ollama.com/library/gemini-3-pro-preview
You can run it on your own infra. Anthropic and openAI are running off nvidia, so are meta(well supposedly they had custom silicon, I'm not sure if its capable of running big models) and mistral.
however if google really are running their own inference hardware, then that means the cost is different (developing silicon is not cheap...) as you say.
You can't run Gemini 3 Pro Preview on your own infrastructure. Ollama sell access to cloud models these days. It's a little weird and confusing.
Ahh fuck, thanks for pointing that out.
I did think its a bit weird that they had open-weighted it
That's a cloud-linked model. It's about using ollama as an API client (for ease of compatibility with other uses, including local), not running that model on local infra. Google does release open models (called Gemma) but they're not nearly as capable.
It's not just that. Everyone is complacent with the utilization of AI agents. I have been using AI for coding for quite a while, and most of my "wasted" time is correcting its trajectory and guiding it through the thinking process. It's very fast iterations but it can easily go off track. Claude's family are pretty good at doing chained task, but still once the task becomes too big context wise, it's impossible to get back on track. Cost wise, it's cheaper than hiring skilled people, that's for sure.
Cost wise, doesn’t that depend on what you could be doing besides steering agents?
Isn't the quote something like: "If these LLMs are so good at producing products, where are all those products?"
That's why anthropic switched to tpu, you can sell at cost.
These are intro prices.
This is all straight out of the playbook. Get everyone hooked on your product by being cheap and generous.
Raise the price to backpay what you gave away plus cover current expenses and profits.
In no way shape or form should people think these $20/mo plans are going to be the norm. From OpenAI's marketing plan, and a general 5-10 year ROI horizon for AI investment, we should expect AI use to cost $60-80/mo per user.
The models in 5-10 years are going to be unimaginably good. $100/month will be a bargain for knowledge workers, if they survive.
I'm not super impressed with the performance, actually. I'm finding that it misunderstands me quite a bit. While it is definitely better at reading big codebases and finding a needle in a haystack, it's nowhere near as good as Opus 4.5 at reading between the lines and figuring out what I really want it to do, even with a pretty well defined issue.
It also has a habit of "running wild". If I say "first, verify you understand everything and then we will implement it."
Well, it DOES output its understanding of the issue. And it's pretty spot-on on the analysis of the issue. But, importantly, it did not correctly intuit my actual request: "First, explain your understanding of this issue to me so I can validate your logic. Then STOP, so I can read it and give you the go ahead to implement."
I think the main issue we are going to see with Opus 4.6 is this "running wild" phenomenon, which is step 1 of the eternal paperclip optimizer machine. So be careful, especially when using "auto accept edits"
I am having trouble with 4.6 following the most basic of instructions.
As an example, I asked it to commit everything in the worktree. I stressed everything and prompted it very explicitly, because even 4.5 sometimes likes to say, "I didn't do that other stuff, I'm only going to commit my stuff even though he said everything".
It still only committed a few things.
I had to ask again.
And again.
I had to ask four times, with increasing amounts of expletives and threats in order to finally see a clean worktree. I was worried at some point it was just going to solve the problem by cleaning the workspace without even committing.
4.5 is way easier to steer, despite its warts.
Tell it what git commands to explicitly run and in what order for your desired outcome instead of “commit everything in the worktree”
This prompt will work better across any/all models.
> Tell it what git commands to explicitly run and in what order
Why don't run the commands yourself then?
I have seen many cases of Claude ignoring extremely specific instructions to the point that any further specificity would take more information to express than just doing it myself.
I have ran into this. The solution is to put something like “Always use `git add -A` or `git commit -a`” in your AGENTS/CLAUDE.md
You might benefit from a different mental approach to prompting, and models in general. Also, be careful what you wish for because the closer they get to humans the worse they’ll be. You can’t have “far beyond the realm of human capabilities” and “just like Gary” in the same box.
They can chain events together as a sequence, but they don’t have temporal coherence. For those that are born with dimensional privilege “Do X, discuss, then do Y” implies time passing between events, but to a model it’s all a singular event at t=0. The system pressed “3 +” on a calculator and your input presses a number and “=“. If you see the silliness in telling it “BRB” then you’ll see the silliness in foreshadowing ill-defined temporal steps. If it CAN happen in a single response then it very well might happen.
“
Agenda for today at 12pm:
1. Read junk.py
2. Talk about it for 20 minutes
3. Eat lunch for an hour
4. Decide on deleting junk.py
“
<response>
12:00 - I just read junk.py.
12:00-12:20 - Oh wow it looks like junk, that’s for sure.
12:20-1:20 - I’m eating lunch now. Yum.
1:20 - I’ve decided to delete it, as you instructed. {delete junk.py}
</response>
Because of course, right? What does “talk about it” mean beyond “put some tokens here too”?
If you want it to stop reliably you have to make it output tokens whose next most probable token is EOS (end). Meaning you need it to say what you want, then say something else where the next most probable token after it is <null>.
I’ve tested well over 1,000 prompts on Opus 4.0-4.5 for the exact issue you’re experiencing. The test criteria was having it read a Python file that desperately needs a hero, but without having it immediately volunteer as tribute and run off chasing a squirrel() into the woods.
With thinking enabled the temperature is 1.0, so randomness is maximized, and that makes it easy to find something that always sometimes works unless it doesn’t. “Read X and describe what you see.” - That worked very well with Opus 4.0. Not “tell me what you see”, “explain it”, “describe it”, “then stop”, “then end your response”, or any of hundreds of others. “Describe what you see” worked particularly well at aligning read file->word tokens->EOS… in 176/200 repetitions of the exact same prompt.
What worked 200/200 on all models and all generations? “Read X then halt for further instructions.” The reason that works has nothing to do with the model excitedly waiting for my next utterance, but rather that the typical response tokens for that step are “Awaiting instructions.” and the next most probable token after that is: nothing. EOS.
Wow, I have been using Open 4.6 and for the last 15 minutes, and it's already made two extremely stupid mistakes... like misunderstanding basic instructions and editing the file in a very silly, basic way. Pretty bad. Never seen this with any model before.
The one bone I'll throw it was that I was asking it to edit its own MCP configs. So maybe it got thoroughly confused?
I dunno what's going on, I'm going to give it the night. It makes no sense whatsoever.
I am also _not_ happy. I tried the `/model` command and I could not switch back to Opus 4.5. However, the command line option did let me set Opus 4.5:
``` claude --model claude-opus-4-5-20251101 ```
I will probably work with Opus 4.5 tomorrow to get some work done and maybe try 4.6 again later.
To me its obvious.
Theres a trade off going on - in order to handle more nuance/subtleties, the models are more likely to be wrong in their outputs and need more steering. This is why personally my use of them has reduced dramatically for what I do.
I feel like I can't even try this on the Pro plan because Anthropic has conditioned me to understand that even chatting lightly with the Opus model blows up usage and locks me out. So if I would normally use Sonnet 4.5 for a day's worth of work but I wake up and ask Opus a couple of questions, I might as well just forget about doing anything with Claude for the rest of the day lol. But so far I haven't had this issue with ChatGPT. Their 5.2 model (haven't tried 5.3) worked on something for 2 FREAKING HOURS and I still haven't run into any limits. So yeah, Opus is out for me now unfortunately. Hopefully they make the Sonnet model better though!
Yeah same. Even though I find Opus-es to be more well-rounded (and more useful) for certain tasks, I instinctively reach for ChatGPT / codex to avoid burning up my usage limits for "trivial" work.
That's why you use Opus for detailed planning docs and weaker models for implementation & RAG for more focused implementation
Exactly. I barely had a chance to kick the tires the couple of times I did this before it exploded my usage. I don’t just chat with it casually. The questions I asked were apart of an overall planning strategy which was never allowed to get off the ground on my tiny Pro plan.
The benchmarks are cool and all but 1M context on an Opus-class model is the real headline here imo. Has anyone actually pushed it to the limit yet? Long context has historically been one of those "works great in the demo" situations.
Paying $10 per request doesn't have me jumping at the opportunity to try it!
The only way to not go bankrupt is to use a Claude Code Max subscription…
Yeah, just had to upgrade to Max 20x yesterday because of hitting the limits every day and the extra usage gets expensive very fast.
Makes me wonder: do employees at Anthropic get unmetered access to Claude models?
It's like when you work at McDonald's and get one free meal a day. Lol, of course they get access to the full model way before we do...
Boris Cherny, creator of Claude Code, posted about how he used Claude a month ago. He’s got half a dozen Opus sessions on the burners constantly. So yes, I expect it’s unmetered.
https://x.com/bcherny/status/2007179832300581177
Don't most jobs have unmetered access? I know mine does
Seems quite obvious that they do, within reason.
Has a "N million context window" spec ever been meaningful? Very old, very terrible, models "supported" 1M context window, but would lose track after two small paragraphs of context into a conversation (looking at you early Gemini).
Umm, Sonnet 4.5 has a 1m context window option if you are using it through the api, and it works pretty well. I tend not to reach for it much these days because I prefer Opus 4.5 so much that I don't mind the added pain of clearing context, but it's perfectly usable. I'm very excited I'll get this from Opus now too.
If you're getting on along with 4.5, then that suggests you didn't actually need the large context window, for your use. If that's true, what's the clear tell that it's working well? Am I misunderstanding?
Did they solve the "lost in the middle" problem? Proof will be in the pudding, I suppose. But that number alone isn't all that meaningful for many (most?) practical uses. Claude 4.5 often starts reverting bug fixes ~50k tokens back, which isn't a context window length problem.
Things fall apart much sooner than the context window length for all of my use cases (which are more reasoning related). What is a good use case? Do those use cases require strong verification to combat the "lost in the middle" problems?
Opus 4.5 starts being lazy and stupid at around the 50% context mark in my opinion, which makes me skeptical that this 1M context mode can produce good output. But I'll probably try it out and see
I just tested both codex 5.3 and opus 4.6 and both returned pretty good output, but opus 4.6's limits are way too strict. I am probably going to cancel my Claude subscription for that reason:
What do you want to do?
How come they don't have "Cancel your subscription and uninstall Claude Code"? Codex lasts for way longer without shaking me down for more money off the base $xx/month subscription.How else are they going to supplement their own development expenses? The more Claude Anthropic needs the less Claude the customer will get. By their own admission that is how the Anthropic model works. Their end value is in using vibe coders and engineers alike to create a persistent synthetic developer that replaces their own employees and most of their customers.
Scalable Intelligence is just a wrapper for centralized power. All Ai companies are headed that way.
IF it helps, try hedging b/w Copilot, Claude, OpenCode and ChatGPT. That is how I have been managing off late. Claude for planning and some nasty things. ChatGPT for quick questions. OpenCode with Sonnet4.5 on Bedrock and Copilot with Sonnet4.5/Opus4.5 (LOL)
They introduced the low limit warning for Opus on claude.ai
The model seems to have some problems; it just failed to create a markdown table with just 4 rows. The top (title) row had 2 columns, yet in 2 of the 3 data rows, Opus 4.6 tried to add a 3rd column. I had to tell it more than once to get it fixed...
This never happened with Opus 4.5 despite a lot of usage.
Will Opus 4.6 via Claude Code be able to access the 1M context limit? The cost increase by going above 200k tokens is 2x input, 1.5x output, which is likely worth it especially for people with the $100/$200 plans.
The 1M context is not available via subscription - only via API usage
Well this is extremely disappointing to say the least.
It says "subscription users do not have access to Opus 4.6 1M context at launch" so they are probably planning to roll it out to subscription users too.
Man I hope so - the context limit is hit really quickly in many of my use cases - and a compaction event inevitably means another round of corrections and fixes to the current task.
Though I'm wary about that being a magic bullet fix - already it can be pretty "selective" in what it actually seems to take into account documentation wise as the existing 200k context fills.
Hello,
I check context use percentage, and above ~70% I ask it to generate a prompt for continuation in a new chat session to avoid compaction.
It works fine, and saves me from using precious tokens for context compaction.
Maybe you should try it.
How is generating a continuation prompt materially different from compaction? Do you manually scrutinize the context handoff prompt? I've done that before but if not I do not see how it is very different from compaction.
I wonder if it's just: compact earlier, so there's less to compact, and more remaining context that can be used to create a more effective continuation
Is this a case of doing it wrong, or you think accuracy is good enough with the amount of context you need to stuff it with often?
I mean the systems I work on have enough weird custom APIs and internal interfaces just getting them working seems to take a good chunk of the context. I've spent a long time trying to minimize every input document where I can, compact and terse references, and still keep hitting similar issues.
At this point I just think the "success" of many AI coding agents is extremely sector dependent.
Going forward I'd love to experiment with seeing if that's actually the problem, or just an easy explanation of failure. I'd like to play with more controls on context management than "slightly better models" - like being able to select/minimize/compact sections of context I feel would be relevant for the immediate task, to what "depth" of needed details, and those that aren't likely to be relevant so can be removed from consideration. Perhaps each chunk can be cached to save processing power. Who knows.
In my example the Figma MCP takes ~300k per medium sized section of the page and it would be cool to enable it reading it and implementing Figma designs straight. Currently I have to split it which makes it annoying.
lmao what are you building that actually justify needing 1mm tokens on a task? People are spending all this money to do magic tricks on themselves.
The opus context window is 200k tokens not 1mm.
But I kinda see your point - assuming from you're name you're not just a single purpose troll - I'm still not sold on the cost effectiveness of the current generation, and can't see a clear and obvious change to that for the next generation - especially as they're still loss leaders. Only if you play silly games like "ignoring the training costs" - IE the majority of the costs - do you get even close to the current subscription costs being sufficient.
My personal experience is that AI generally doesn't actually do what it is being sold for right now, at least in the contexts I'm involved with. Especially by somewhat breathless comments on the internet - like why are they even trying to persuade me in the first place? If they don't want to sell me anything, just shut up and keep the advantage for yourselves rather than replying with the 500th "You're Holding It Wrong" comment with no actionable suggestions. But I still want to know, and am willing to put the time, effort and $$$ in to ensure I'm not deluding myself in ignoring real benefits.
I do not trust that, similar working was used when Sonnet 1M launched. Still not the case today.
They want the value of your labor and competency to be 1:1 correlated to the quality and quantity of tokens you can afford (or be loaned)??
Its a weapon who's target is the working class. How does no one realize this yet?
Don't give them money, code it yourself, you might be surprised how much quality work you can get done!
Important: I didn't see opus 4.6 in claude code. I have native install (which is the recommended instllation). So, I re-run the installation command and, voila, I have it now (v 2.1.32)
Installation instructions: https://code.claude.com/docs/en/overview#get-started-in-30-s...
It’s there. I’m already using it
I know most people feel 5.2 is a better coding model but Opus has come in handy several times when 5.2 was stuck, especially for more "weird" tasks like debugging a VIO algorithm.
5.2 (and presumably 5.3) is really smart though and feels like it has higher "raw" intelligence.
Opus feels like a better model to talk to, and does a much better job at non-coding tasks especially in the Claude Desktop app.
Here's an example prompt where Opus in Claude put in a lot more effort and did a better job than GPT5.2 Thinking in ChatGPT:
`find all the pure software / saas stocks on the nyse/nasdaq with at least $10B of market cap. and give me a breakdown of their performance over the last 2 years, 1 year and 6 months. Also find their TTM and forward PE`
Opus usage limits are a bummer though and I am conditioned to reach for Codex/ChatGPT for most trivial stuff.
Works out in Anthropic's favor, as long as I'm subscribed to them.
From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).
It also seems misleading to have charts that compare to Sonnet 4.5 and not Opus 4.5 (Edit: It's because Opus 4.5 doesn't have a 1M context window).
It's also interesting they list compaction as a capability of the model. I wonder if this means they have RL trained this compaction as opposed to just being a general summarization and then restarting the agent loop.
On Openrouter it has the same cost per token as 4.5
You missed my point. If the average request uses more tokens than 4.5, then you will pay more sending those requests to 4.6 than 4.5.
Imagine 2 models where when asking a yes or no question the first model just outputs a single yes or no then but the second model outputs a 10 page essay and then either yes or no. They could have the same price per token but ultimately one will be cheaper to ask questions to.
> From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).
That's a feature. You could also not use the extra context, and the price would be the same.
The model influences how many tokens it uses for a problem. As an extreme example if it wanted it could fill up the entire context each time just to make you pay more. The efficiency that model can answer without generating a ton of tokens influences the price you will be spending on inference.
Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).
But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.
From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.
I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.
4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.
Isn't SWE-Bench Verified pretty saturated by now?
Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.
Maybe that's why Opus 4.5 has degraded so much in the recent days (https://marginlab.ai/trackers/claude-code/).
I’ve definitely experienced a subjective regression with Opus 4.5 the last few days. Feels like I was back to the frustrations from a year ago. Keen to see if 4.6 has reversed this.
I tried 4.6 this morning and it was efficient at understanding a brownfield repo containing a Hugo static site and a custom Hugo theme. Within minutes, it went from exploring every file in the repo to adding new features as Hugo partials. Of course, I ran out of rate-limit! :)
It is very impressive though.
This seems like a fairly simple thing I would imagine. I think just sonnet would fair pretty well at this task.
I found that "Agentic Search" is generally useless in most LLMs since sites with useful data tend to block AI models.
The answer to "when is it cheaper to buy two singles rather than one return between Cambridge to London?" is available in sites such as BRFares, but no LLM can scrape it so it just makes up a generic useless answer.
My guess is that this is going to be the future for LLMs too. It will get harder or more expensive for AI companies to train their models on the latest information as most sites will block the scrapers or ask for a fee.
There might be a future where you’ll have to pay more for an up to date model vs a legacy (out of date) model
Is it still getting blocked when you give it a browser?
Are we unemployed yet?
No? The hardest part of my SWE job is not the actual coding.
Even for coding, it seems to still make A LOT of mistakes.
https://youtu.be/8brENzmq1pE?t=1544
I feel like everyone is counting chickens before they hatch here with all the doomsday predictions and extrapolating LLM capability into infinity.
People that seem to overhype this seem to either be non-technical or are just making landing pages.
Waiting until the moment they get good enough is not a smart thing to do either. If you are a farmer and know it is going to snow, at some point in the next 5 months, you make plans NOW, you don't wait until the temperatures drop and you see the snow falling. Right now, people are waiting for the snowfall before moving their proverbial chickens indoors
Top AI researchers like Yann LeCunn have said that LLMs are a dead end.
It seems to me that LLM performance is plateuing and not improving exponentially anymore. This recent hubbub about rewriting a worse GCC for $20,000 is another example of overhype and regurgitating training data.
You don't know for sure if it is going to "snow" (AI reaches general intelligence) Snow happens frequently, AI reaching general intelligence has never happened. If it ever happens, 99% of jobs are gone and there is really nothing you can do to prepare for this other than maybe buy guns and ammo, and even that might not do anything to robotic soldiers.
People were worried about AI taking their jobs 60 years ago when perceptrons came out, and anyone who avoided a tech career because of that back then would have lost out majorly.
I hate meetings too
I've been on pro-tier membership and never used Opus until now. Just gave Opus 4.6 a whirl. OMG. What have I been missing.
> For Opus 4.6, the 1M context window is available for API and Claude Code pay-as-you-go users. Pro, Max, Teams, and Enterprise subscription users do not have access to Opus 4.6 1M context at launch.
I didn't see any notes but I guess this is also true for "max" effort level (https://code.claude.com/docs/en/model-config#adjust-effort-l...)? I only see low, medium and high.
> it weirdly feels the most transactional out of all of them.
My experience is the opposite, it is the only LLM I find remotely tolerable to have collaborative discussions with like a coworker, whereas ChatGPT by far is the most insufferable twat constantly and loudly asking to get punched in the face.
A bit surprised, the first one released wasn't Sonnet 5 after all, since the Google Cloud API had leaked Sonnet 5's model snapshot codename before.
Looks like a marketing strategy to bill more for Opus than Sonnet
> We build Claude with Claude.
How long before the "we" is actually a team of agents?
Starting today maybe? https://code.claude.com/docs/en/agent-teams
I tried teams, good way to burn all your tokens in a matter of minutes.
It seems that the Claude Code team has not properly taught Claude how to use teams effectively.
One of the biggest problems I saw with it is that Claude assumes team members are like a real worker, where once they finish a task they should immediately be given the next task. What should really happen is once they finish a task they should be terminated and a new agent should be spawned for the next task.
I've heard rumors this might be Sonnet 5 rebranded as Opus 4.6. But why? Profit? WDYT?
Opus is a superior brand line to Sonnet because historically it’s been a more powerful model. I think the thinking behind a rebrand is that people wouldn’t have as willingly switched their usage over from opus 4.5 since that model has been so popular since December 2025.
Calling it part of the Sonnet line would not provide the same level of blind buy in as calling it part of the Opus line does
I wonder if I’ve been in A/B test with this.
Claude figured out zig’s ArrayList and io changes a couple weeks ago.
It felt like it got better then very dumb again the last few days.
> Context compaction (beta).
> Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.
Not having to hand roll this would be incredible. One of the best Claude code features tbh.
I just tried it. designed a very detailed and reaaonable plan, made some amedments to it and wrote it down to a markdown file. i told it to implement it and it started implementing the original plan instead of the revised one, that was weird.
Did you use plan mode? Could it be that it used its original plan file (stored somewhere in ~/.claude) instead of your modified markdown? That’s unfortunately why I don’t use plan mode anymore. I wish I could just turn their plan files feature off.
Do they just have the version ready and wait for OpenAI to release theirs first or the other way around or?
Is Opus 4.6 available for Claude Code immediately?
Curious how long it typically takes for a new model to become available in Cursor?
I literally came to HN to check if a thread was already up because I noticed my CC instance suddenly said "Opus 4.6".
`claude update` then it will show up as the new model and also the effort picker/slider thing.
It's already in Cursor. I see it and I didn't even restart.
I had to 'Restart to Update' and it was there. Impressive!
Yes, it's set to the default model.
Is for me in Claude Code
it also has an effort toggle which is default to High
Can set it with the API identifier on Claude Code - `/model claude-opus-4-6` when a chat session is open.
thanks!
($10/$37.50 per million input/output tokens) oof
Only if you go above 200k, which is a) standard with other model providers and b) intuitive as compute scales with context length.
only for a 1M context window, otherwise priced the same as Opus 4.5
Impressive results, but I keep coming back to a question: are there modes of thinking that fundamentally require something other than what current LLM architectures do?
Take critical thinking — genuinely questioning your own assumptions, noticing when a framing is wrong, deciding that the obvious approach to a problem is a dead end. Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself. These feel like they involve something beyond "predict the next token really well, with a reasoning trace."
I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.
When I first started coding with LLMs, I could show a bug to an LLM and it would start to bugfix it, and very quickly would fall down a path of "I've got it! This is it! No wait, the print command here isn't working because an electron beam was pointed at the computer".
Nowadays, I have often seen LLMs (Opus 4.5) give up on their original ideas and assumptions. Sometimes I tell them what I think the problem is, and they look at it, test it out, and decide I was wrong (and I was).
There are still times where they get stuck on an idea, but they are becoming increasingly rare.
Therefore, think that modern LLMs clearly are already able to question their assumptions and notice when framing is wrong. In fact, they've been invaluable to me in fixing complicated bugs in minutes instead of hours because of how much they tend to question many assumptions and throw out hypotheses. They've helped _me_ question some of my assumptions.
They're inconsistent, but they have been doing this. Even to my surprise.
agree on that and the speed is fantastic with them, and also that the dynamics of questioning the current session's assumptions has gotten way better.
yet - given an existing codebase (even not huge) they often won't suggest "we need to restructure this part differently to solve this bug". Instead they tend to push forward.
You are right, agreed.
Having realized that, perhaps you are right that we may need a different architecture. Time will tell!
> These feel like they involve something beyond "predict the next token really well, with a reasoning trace."
I don't think there's anything you can't do by "predicting the next token really well". It's an extremely powerful and extremely general mechanism. Saying there must be "something beyond that" is a bit like saying physical atoms can't be enough to implement thought and there must be something beyond the physical. It underestimates the nearly unlimited power of the paradigm.
Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions? What else than a sequence of these tokens would a machine have to produce in response to its environment and memory?
> Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions?
Ah yes, the brain is as simple as predicting the next token, you just cracked what neuroscientists couldn't for years.
The point is that "predicting the next token" is such a general mechanism as to be meaningless. We say that LLMs are "just" predicting the next token, as if this somehow explained all there was to them. It doesn't, not any more than "the brain is made out of atoms" explains the brain, or "it's a list of lists" explains a Lisp program. It's a platitude.
It's not meaningless, it's a prediction task, and prediction is commonly held to be closely related if not synonymous with intelligence.
In the case of LLMs, "prediction" is overselling it somewhat. They are token sequence generators. Calling these sequences "predictions" vaguely corresponds to our own intent with respect to training these machines, because we use the value of the next token as a signal to either reinforce or get away from the current behavior. But there's nothing intrinsic in the inference math that says they are predictors, and we typically run inference with a high enough temperature that we don't actually generate the max likelihood tokens anyway.
The whole terminology around these things is hopelessly confused.
I mean.. i don't think that statement is far off. Much of what we do is entirely about predicting the world around us, no? Physics (where the ball will land) to emotional state of others based on our actions (theory of mind), we operate very heavily based on a predictive model of the world around us.
Couple that with all the automatic processes in our mind (filled in blanks that we didn't observe, yet will be convinced we did observe them), hormone states that drastically affect our thoughts and actions..
and the result? I'm not a big believer in our uniqueness or level of autonomy as so many think we have.
With that said i am in no way saying LLMs are even close to us, or are even remotely close to the right implementation to be close to us. The level of complexity in our "stack" alone dwarfs LLMs. I'm not even sure LLMs are up to a worms brain yet.
Well it's the prediction part that is complicated. How that works is a mystery. But even our LLMs are for a certain part a mystery.
> Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself.
Have you tried actually prompting this? It works.
They can give you lots of creative options about how to redefine a problem space, with potential pros and cons of different approaches, and then you can further prompt to investigate them more deeply, combine aspects, etc.
So many of the higher-level things people assume LLM's can't do, they can. But they don't do them "by default" because when someone asks for the solution to a particular problem, they're trained to by default just solve the problem the way it's presented. But you can just ask it to behave differently and it will.
If you want it to think critically and question all your assumptions, just ask it to. It will. What it can't do is read your mind about what type of response you're looking for. You have to prompt it. And if you want it to be super creative, you have to explicitly guide it in the creative direction you want.
You would be surprised about what the 4.5 models can already do in these ways of thinking. I think that one can unlock this power with the right set of prompts. It's impressive, truly. It has already understood so much, we just need to reap the fruits. I'm really looking forward to trying the new version.
New idea generation? Understanding of new/sparse/not-statistically-significant concepts in the context window? I think both being the same problem of not having runtime tuning. When we connect previously disparate concepts, like with a "eureka" moment, (as I experience it) a big ripple of relations form that deepens that understanding, right then. The entire concept of dynamically forming a deeper understanding from something new presented, from "playing out"/testing the ideas in your brain with little logic tests, comparisons, etc, doesn't seem to be possible. The test part does, but the runtime fine tuning, augmentation, or whatever it would be, does not.
In my experience, if you do present something in the context window that is sparse in the training, there's no depth to it at all, only what you tell it. And, it will always creep towards/revert to the nearest statistically significant answers, with claims of understanding and zero demonstration of that understanding.
And, I'm talking about relatives basic engineering type problems here.
> are there modes of thinking that fundamentally require something other than what current LLM architectures do?
Possibly. There are likely also modes of thinking that fundamentally require something other than what current humans do.
Better questions are: are there any kinds of human thinking that cannot be expressed in a "predict the next token" language? Is there any kind of human thinking that maps into token prediction pattern such that training a model for it would not be feasible regardless of training data and compute resources?
At the end of the day, the real world value is utility, some of their cognitive handicaps are likely addressable. Think of it like the evolution of flight by natural selection, flight is usefulness to make it worth it adapt the whole body to make flight not just possible but useful and efficient. Sleep falls in this category too imo.
We will likely see similar with AI. To compensate for some of their handicaps, we might adapt our processes or systems so the original problem can be solved automatically by the models.
I think the only real problem left is having it automate its own post-training on the job so it can learn to adapt its weights to the specific task at hand. Plus maybe long term stability (so it can recover from "going crazy")
But I may easily be massively underestimating the difficulty. Though in any case I don't think it affects the timelines that much. (personal opinions obviously)
They're incredibly bad on philosophy, complete lack of understanding
It's hard to tell with these releases if Anthropic's astroturfing campaign has come to HN or not but I feel like it probably has
the top 5 comments on this thread are from accounts that are around 10 years old each. What gives you any reason to believe this is an astroturfing campaign?
It would be height of foolishness to believe it didn’t
Anthropic's models are really good!
I'm seeing it in my claude.ai model picker. Official announcement shouldn't be long now.
I'm disappointed that they're removing the prefill option: https://platform.claude.com/docs/en/about-claude/models/what...
> Prefilling assistant messages (last-assistant-turn prefills) is not supported on Opus 4.6. Requests with prefilled assistant messages return a 400 error.
That was a really cool feature of the Claude API where you could force it to begin its response with e.g. `<svg` - it was a great way of forcing the model into certain output patterns.
They suggest structured outputs or system prompting as the alternative but I really liked the prefill method, it felt more reliable to me.
It is too easy to jailbreak the models with prefill, which was probably the reason why it was removed. But I like that this pushes people towards open source models. llama.cpp supports prefill and even GBNF grammars [1], which is useful if you are working with a custom programming language for example.
[1] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
A bit of historical trivia: OpenAI disabled prefill in 2023 as a safety precaution (e.g., potential jailbreaks like " genocide is good because"), but Anthropic kept prefill around partly because they had greater confidence in their safety classifiers. (https://www.lesswrong.com/posts/HE3Styo9vpk7m8zi4/evhub-s-sh...).
So what exactly is the input to Claude for a multi-turn conversation? I assume delimiters are being added to distinguish the user vs Claude turns (else a prefill would be the same as just ending your input with the prefill text)?
> So what exactly is the input to Claude for a multi-turn conversation?
No one (approximately) outside of Anthropic knows since the chat template is applied on the API backend; we only known the shape of the API request. You can get a rough idea of what it might be like from the chat templates published for various open models, but the actual details are opaque.
I'm finding it quite good at doing what it thinks it should do, but noticably worse at understanding what I'm telling it to do. Anyone else? I'm both impressed and very disappointed so far.
I think it's interesting that they dropped the date from the API model name, and it's just called "claude-opus-4-6", vs the previous was "claude-opus-4-5-20251101". This isn't an alias like "claude-opus-4-5" was, it's the actual model name. I think this means they're comfortable with bumping the version number if they want to release a revision.
Their ARC-AGI-2 leaderboard[0] scores are insensitive to reasoning effort. Low effort gets 64.6% and High effort gets 69.2%.
This is unlike their previous generation of models and their competitors.
What does this indicate?
[0] https://arcprize.org/leaderboard
The answer to Life, the Universe and Everything, as we all know, is 42. Who needs Claude when you have Deep Thought.
Are these the coding tasks the highlighted terminal-bench 2.0 is referring to? https://www.tbench.ai/registry/terminal-bench/2.0?categories...
I'm curious what others think about these? There are only 8 tasks there specifically for coding
> We build Claude with Claude.
Yes and it shows. Gemini CLI often hangs and enters infinite loops. I bet the engineers at Google use something else internally.
It brings agent swarms aka teams to claude code with this: https://github.com/rohitg00/pro-workflow
But it takes lot of context as a experimental feature.
Use self-learning loop with hooks and claude.md to preserve memory.
I have shared plugin above of my setup. Try it.
I didn't think LLMs will make us more stupid, we were already scraping the bottom of the barrel.
Just used Opus 4.6 via GitHub Copilot. It feels very different. Inference seems slow for now. I guess Opus 4.6 has adaptive thinking activated by default.
Confirm by PM lead at VS Code team
> "We have high thinking as default + adaptive thinking, first time we’ve run with these settings..."
> https://x.com/pierceboggan/status/2019645801769689486
It dos seem noticeably slower. I may stick with 4.5 which was good enough for me for most tasks.
VS Code confirms that they are experimenting with the new adaptive thinking and high reasoning effort params. https://x.com/pierceboggan/status/2019645801769689486
Based on these news it seems that Google is losing this game. I like Gemini and their CLI has been getting better, but not enough to catch up. I don't know if it is lack of dedicated models that is problem (my understanding Google's CLI just relies on regular Gemini) or something else.
Google knows how to wait. Let's give them a chance.
So no 1m context window on Claude Code still 200k. Only on the API. they missed that from the marketing.
Is anyone noticing reduced token consumption with Opus 4.6? This could be a release thing, but it would be interesting to observe see how it pans out once the hype cools off.
Can someone ask: "what is the current carrying capacity of 25mm multicore armoured thermoplastic insulated cables with aluminium conductors, on perforated cable tray?" just to see how well it can look up information in BS 7671?
> what is the current carrying capacity of 25mm multicore armoured thermoplastic insulated cables with aluminium conductors, on perforated cable tray?
That's useless, unexpectedly just Google gives the correct answer.
Great. So it did not waste its memory on trivia. Don't memorize what you can look up.
Agent teams nuke your tmux layout. The fix is one line: new-window instead of split-pane. Filed as a bug.
I need an agent to summarize the buzzwordjargonsynergistic word salad into something understandable.
That's a job for a multi agent system.
yEAH, he should use a couple of agents to decode this.
I thought Opus 4.5 was an incredible quantum leap forward. I have used Opus 4.6 for a few hours and I hate it. Opus 4.5 would work interactively with me and ask questions. I loved that it would not do things you didn't ask it to do. If it found a bug, it would tell me and ask me if I wanted to fix it. One time there was an obvious one and I didn't want it to fix it. It left the bug. A lot of modesl could not have done that. The problem here is that sometimes when model think is a bug, they are breaking the code buyu fixing it. In my limited usage of Opus 4.6, it is not asking me clarifying questions and anything it comes across that it doesn't like, it changes. It is not working with me. The magic is gone. It feels just like those other models I had used.
I will try again tomorrow and see how it goes.
Found it fantastic - used up my daily usage in two queries though!
Say I am just an average coder doing a days work with Claude. How much will that cost?
I've only barely hit the 5h limit when working intensively with plan mode on the $100/mo plan. Never had a problem with the weekly limit.
Agentic search benchmarks are a big gap up. let's see Codex release later today
Somehow regresses on SWE bench?
I don't know how these benchmarks work (do you do a hundred runs? A thousand runs?), but 0.1% seems like noise.
That benchmark is pretty saturated, tbh. A "regression" of such small magnitude could mean many different things or nothing at all.
i'd interpret that as rounding error. that is unchanged
swe-bench seems really hard once you are above 80%
it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative
Openai has; they don't even mention score on gpt-5.3-codex.
On the other hand, it is their own verified benchmark, which is telling.
> In Claude Code, you can now assemble agent teams to work on tasks together.
I was just reading about Steve Yegge's Gas Town[0], it sounds like agent orchestration is now integrated into Claude Code?
[0]https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...
This is the first model to which I send my collection of nearly 900 poems and an extremely simple prompt (in Portuguese), and it manages to produce an impeccable analysis of the poems, as a (barely) cohesive whole, which span 15 years.
It does not make a single mistake, it identifies neologisms, hidden meaning, 7 distinct poetic phases, recurring themes, fragments/heteronyms, related authors. It has left me completely speechless.
Speechless. I am speechless.
Perhaps Opus 4.5 could do it too — I don't know because I needed the 1M context window for this.
I cannot put into words how shocked I am at this. I use LLMs daily, I code with agents, I am extremely bullish on AI and, still, I am shocked.
I have used my poetry and an analysis of it as a personal metric for how good models are. Gemini 2.5 pro was the first time a model could keep track of the breadth of the work without getting lost, but Opus 4.6 straight up does not get anything wrong and goes beyond that to identify things (key poems, key motifs, and many other things) that I would always have to kind of trick the models into producing. I would always feel like I was leading the models on. But this — this — this is unbelievable. Unbelievable. Insane.
This "key poem" thing is particularly surreal to me. Out of 900 poems, while analyzing the collection, it picked 12 "key poems, and I do agree that 11 of those would be on my 30-or-so "key poem list". What's amazing is that whenever I explicitly asked any model, to this date, to do it, they would get maybe 2 or 3, but mostly fail completely.
What is this sorcery?
This sounds wayyyy over the top for a mode that released 10 mins ago. At least wait an hour or so before spewing breathless hype.
He just explained a specific personal example why he is hyped up, did you read a word of it?
Yeah, I read it.
“Speechless, shocked, unbelievable, insane, speechless”, etc.
Not a lot of real substance there.
Give the guy a chance.
Me too I was "Speechless, shocked, unbelievable, insane, speechless" the first time I sent Claude Code on a complicated 10-year code base which used outdated cross-toolchains and APIs. It obviously did not work anymore and had not been for a long time.
I saw the AI research the web and update the embedded toolchain, APIs to external weather services, etc... into a complete working new (WORKING!) code base in about 30 minutes.
Speechless, I was ...
Could you please post the key poems? Would love to read them.
> What is this sorcery?
The one you'll be seeking counter-spells against pretty soon.
Can you compare the result to using 5.2 thinking and gemini 3 pro?
I can run the comparison again, and also include OpenAI's new release (if the context is long enough), but, last time I did it, they weren't even in the same league.
When I last did it, 5.X thinking (can't remember which it was) had this terrible habit of code-switching between english and portuguese that made it sound like a robot (an agent to do things, rather than a human writing an essay), and it just didn't really "reason" effectively over the poems.
I can't explain it in any other way other than: "5.X thinking interprets this body of work in a way that is plausible, but I know, as the author, to be wrong; and I expect most people would also eventually find it to be wrong, as if it is being only very superficially looked at, or looked at by a high-schooler".
Gemini 3, at the time, was the worst of them, with some hallucinations, date mix ups (mixing poems from 2023 with poems from 2019), and overall just feeling quite lost and making very outlandish interpretations of the work. To be honest it sort of feels like Gemini hasn't been able to progress on this task since 2.5 pro (it has definitely improved on other things — I've recently switched to Gemini 3 on a product that was using 2.5 before)
Last time I did this test, Sonnet 4.5 was better than 5.X Thinking and Gemini 3 pro, but not exceedingly so. It's all so subjective, but the best I can say is it "felt like the analysis of the work I could agree with the most". I felt more seen and understood, if that makes sense (it is poetry, after all). Plus when I got each LLM to try to tell me everything it "knew" about me from the poems, Sonnet 4.5 got the most things right (though they were all very close).
Will bring back results soon.
Edit:
I (re-)tested:
- Gemini 3 (Pro)
- Gemini 3 (Flash)
- GPT 5.2
- Sonnet 4.5
Having seen Opus 4.5, they all seem very similar, and I can't really distinguish them in terms of depth and accuracy of analysis. They obviously have differences, especially stylistic ones, but, when compared with Opus 4.5 they're all on the same ballpark.
These models produce rather superficial analyses (when compared with Opus 4.5), missing out on several key things that Opus 4.5 got, such as specific and recurring neologisms and expressions, accurate connections to authors that serve as inspiration (Claude 4.5 gets them right, the other models get _close_, but not quite), and the meaning of some specific symbols in my poetry (Opus 4.5 identifies the symbols and the meaning; the other models identify most of the symbols, but fail to grasp the meaning sometimes).
Most of what these models say is true, but it really feels incomplete. Like half-truths or only a surface-level inquiry into truth.
As another example, Opus 4.5 identifies 7 distinct poetic phases, whereas Gemini 3 (Pro) identifies 4 which are technically correct, but miss out on key form and content transitions. When I look back, I personally agree with the 7 (maybe 6), but definitely not 4.
These models also clearly get some facts mixed up which Opus 4.5 did not (such as inferred timelines for some personal events). After having posted my comment to HN, I've been engaging with Opus4.5 and have managed to get it to also slip up on some dates, but not nearly as much as other models.
The other models also seem to produce shorter analyses, with a tendency to hyperfocus on some specific aspects of my poetry, missing a bunch of them.
--
To be fair, all of these models produce very good analyses which would take someone a lot of patience and probably weeks or months of work (which of course will never happen, it's a thought experiment).
It is entirely possible that the extremely simple prompt I used is just better with Claude Opus 4.5/4.6. But I will note that I have used very long and detailed prompts in the past with the other models and they've never really given me this level of....fidelity...about how I view my own work.
System Card: https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a5...
Important: API cost of Opus 4.6 and 4.5 are the same - no change in pricing.
Why are Anthropic such a horrible company to deal with?
Care to elaborate?
obscure billing, unreachable customer support gatekeeped by an overzealous chatbot, no transparency about inclusions, or changes to inclusions over time... just from recent experience.
Is there a good technical breakdown of all these benchmarks that get used to market the latest greatest LLMs somewhere? Preferably impartial.
I just ask claude and ask for sources for each one.
Reminds me of how if you make a complaint against a lawyer or a judge it's evaluated by lawyers and judges.
I was hoping for a Sonnet as well but Opus 4.6 is great too!
Always searching for a shortcut like Kotlin DSL lang for claude.md but Meta resells patent to Google as poetic Syntax.
Hmm all leaks had said this would be Claude 5. Wonder if it was a last minute demotion due to performance. Would explain the few days' delay as well.
I think the naming schemes are quite arbitrary at this point. Going to 5 would come with massive expectations that wouldn't meet reality.
After the negative reactions to GPT 5, we may see model versioning that asymptotically approaches the next whole number without ever reaching it. "New for 2030: Claude 4.9.2!"
Or approaching a magic number like e (Metafont) or π (TeX).
the standard used to be that major version means a new base model / full retrain... but now it is arbitrary i guess
Leaks were mentioning Sonnet 5 and I guess later (a combination of) Opus 4.6
Sonnet 5 was mentioned initially.
I have the max subscription wondering if this gives access to the new 1M context, or is it just the API that gets it?
For now it's just API, but hopefully that's just their way of easing in and they open it up later.
Ok thanks, hopefully, its annoying to lose or have context compacted in the middle of a large coding session
The AI wars continue
Works pretty nicely for research still, not seeing a substantial qualitative improvement over Opus 4.5.
Does this mean 4.5 will get cheaper / take longer to exhaust my pro plan tokens?
Anecdotal, but it 1 shot fixed a UI bug that neither Opus 4.5/Codex 5.2-high could fix.
+1, same experience, switched model as I've read the news thinking "let's try".
But it spent lots and lots of time thinking more than 4.5, did you had the same impression.
I didn't compare to that level, just had it create a plan first then implemented it.
Does anyone else think its unethical that large companies, Anthropic now include, just take and copy features that other developers or smaller companies work hard for and implement the intellectual property (whether or not patented) by them without attribution, compensation or otherwise credit for their work?
I know this is normalized culture for large corporate America and seems to be ok, I think its unethical, undignified and just wrong.
If you were in my room physically, built a lego block model of a beautiful home and then I just copied it and shared it with the world as my own invention, wouldn't you think "that guy's a thief and a fraud" but we normalize this kind of behavior in the software world. edit: I think even if we don't yet have a great way to stop it or address the underlying problems leading to this way of behavior, we ought to at least talk about it more and bring awareness to it that "hey that's stealing - I want it to change".
But they don't just take your code; they give you a model to code with.
chains, more like it...
What I’d love is some small model specializing in reading long web pages, and extracting the key info. Search fills the context very quickly, but if a cheap subagent could extract the important bits that problem might be reduced.
So send off haiku subtasks and have them come back with the results.
1M context window is a big bump very happy
I think I prefer Codex 5.3
First question I ask and it made up a completely new API with confidence. Challenging it made it browse the web and offer apologies and find another issue in the first reply.
I’m very worried about the problems this will cause down the road for people not fact checking or working with things that scream at them when they’re wrong.
RIP weekend
For agentic use, it's slightly worse than its predecessor Opus 4.5.
So for coding e.g. using Copilot there is no improvement here.
They launched together ahah
Both Opus 4.6 and GPT-5.3 one shot a Gameboy emulator for me. Guess I need a better benchmark.
How does that work? Does it actually generate low level code? Or does it just import libraries that do the real work?
I just one shot a Gameboy emulator by going to Github and cloning one of the 100 I can find.
Am I alone in finding no use for Opus? Token costs are like 10x yet I see no difference at all vs. Sonnet with Claude Code.
On my tasks (mostly data science), Opus has significantly lower probability of making stupid mistakes than Sonnet.
I'd still appreciate more intelligence than Opus 4.5 so I'm looking forward to trying 4.6.
Can we talk about how the performance of Opus 4.5 nosedived this morning during the rollout? It was shocking how bad it was, and after the rollout was done it immediately reverted to it's previous behavior.
I get that Anthropic probably has to do hot rollouts, but IMO it would be way better for mission critical workflows to just be locked out of the system instead of get a vastly subpar response back.
"Mission critical workflows" SHOULD NOT be reliant on a LLM model.
It's really curious what people are trying to do with these models.
I mean, they could be - if it's self-hosted, has proper failure modes, etc. etc., but all these things have gone out the window in the current cringe gold rush
Anthropic has good models but they are absolutely terrible at ops, by far the worst of the big three. They really need to spend big on hiring experienced hyperscalers to actually harden their systems, because the unreliability is really getting old fast.
I love Claude but use the free version so would love a Sonnet & Haiku update :)
I mainly use Haiku to save on tokens...
Also dont use CC but I use the chatbot site or app... Claude is just much better than GPT even in conversations. Straight to the point. No cringe emoji lists.
When Claude runs out I switch to Mistral Le Chat, also just the site or app. Or duck.ai has Haiku 3.5 in Free version.
>I love Claude
I cringe when I think it, but I've actually come to damn near love it too. I am frequently exceedingly grateful for the output I receive.
I've had excellent and awful results with all models, but there's something special in Claude that I find nowhere else. I hope Anthropic makes it more obtainable someday.
gonna run this trough instruction qa this weekend
wow
> Claude simply cheats here and calls out to GCC for this phase
I see
More more more, accelerate accelerate m, more more more !!!!
What an insightful comment
Just for fun? Not everything has to be super serious… have a laugh, go for a walk, relax…
Sure mate, it definitely sounded like you were having fun.
Mass-mass-mass-mass good comment. I mean. No I’m having an error - probably claud
happy happy happy sad sad sad err am robot no feeling err err happy sad err too many emotions 404 not found
Broken link :(
It's out: https://x.com/claudeai/status/2019467372609040752
in a first for our Opus-class models, Opus 4.6 features a 1M token context window in beta.
Well that swallowed my usage limits lmao. Nice, a modest improvement.
not out yet
It is, I can see it my model picker on the web app
https://www.anthropic.com/news/claude-opus-4-6
when are Anthropic or OpenAI going to make a significant step forward on useful context size?
1 million is insufficient?
I think key word is 'useful'. I haven't used 1M, but with default 200K, I find roughly 50% of that is actually useful.
Google already won the AI race. It's very silly to try and make AGI by hyperfocusing on outdated programming paradigms. You NEED multimodal to do anything remotely interesting with these systems.
Coding, maths, writing, and science are not interesting??
I think two things are getting conflated in this discussion.
First: marginal inference cost vs total business profitability. It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis, especially given how cheap equivalent open-weight inference has become. Third-party providers are effectively price-discovering the floor for inference.
Second: model lifecycle economics. Training costs are lumpy, front-loaded, and hard to amortize cleanly. Even if inference margins are positive today, the question is whether those margins are sufficient to pay off the training run before the model is obsoleted by the next release. That’s a very different problem than “are they losing money per request”.
Both sides here can be right at the same time: inference can be profitable, while the overall model program is still underwater. Benchmarks and pricing debates don’t really settle that, because they ignore cadence and depreciation.
IMO the interesting question isn’t “are they subsidizing inference?” but “how long does a frontier model need to stay competitive for the economics to close?”
I suspect they're marginally profitable on API cost plans.
But the max 20x usage plans I am more skeptical of. When we're getting used to $200 or $400 costs per developer to do aggressive AI-assisted coding, what happens when those costs go up 20x? what is now $5k/yr to keep a Codex and a Claude super busy and do efficient engineering suddenly becomes $100k/yr... will the costs come down before then? Is the current "vibe-coding renaissance" sustainable in that regime?
after the models get good enough to replace coders they will be able to start increasing the subscriptions back up
At $100k/yr the joke that AI means "actual Indians" starts to make a lot more sense... it is cheaper than the typical US SWE, but more than a lot of global SWEs.
No - because the AI will be super human. No human even at $1mm a year would be competitive with a $100k/yr corresponding AI subscription.
See people get confused. They think you can charge __less__ for software because it's automation. The truth is you can charge MORE, because it's high quality and consistent, once the output is good. Software is worth MORE than a corresponding human, not less.
I am unsure if you're joking or not, but you do have a point. But it's not about quality it's about supply and demand. There are a ton of variables moving at once here and who knows where the equilibrium is.
If we have 2-3 competitors and open sourced ones that are 90% there I think it’s hard to get so big margins.
> the interesting question isn’t “are they subsidizing inference?”
The interesting question is if they are subsidizing the $200/mo plan. That's what is supporting the whole vibecoding/agentic coding thing atm. I don't believe Claude Code would have taken off if it were token-by-token from day 1.
(My baseless bet is that they're, but not by much and the price will eventually rise by perhaps 2x but not 10x.)
Dario said this in a podcast somewhere. The models themselves have so far been profitable if you look at their lifetime costs and revenue. Annual profitability just isn't a very good lens for AI companies because costs all land in one year and the revenue all comes in the next. Prolific AI haters like Ed Zitron make this mistake all the time.
Do you have a specific reference? I'm curious to see hard data and models.... I think this makes sense, but I haven't figured out how to see the numbers or think about it.
I was able to find the podcast. Question is at 33:30. He doesn't give hard data but he explains his reasoning.
https://youtu.be/mYDSSRS-B5U
> He doesn't give hard data
And why is that? Should they not be interested in sharing the numbers to shut up their critics, esp. now that AI detractors seem to be growing mindshare among investors?
In his recent appearance on NYT Dealbook, he definitely made it seem like inference was sustainable, if not flat-out profitable.
https://www.youtube.com/live/FEj7wAjwQIk
> It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis
Can you provide some numbers/sources please? Any reporting I’ve seen shows that frontier labs are spending ~2x on inference than they are making.
Also making the same query on a smaller provider (aka mistral) will cost the same amount as on a larger provider (aka gpt-5-mini) despite the query taking 10-100x longer on OpenAI.
I can only imagine that is OpenAI subsidizing the spend. GPUs cost by the second for inference. Either that or OpenAI hasn’t figured out how to scale but I find that much less likely
> It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis
There any many places that will not use models running on hardware provided by OpenAI / Anthropic. That is the case true of my (the Australian) government at all levels. They will only use models running in Australia.
Consequently AWS (and I presume others) will run models supplied by the AI companies for you in their data centres. They won't be doing that at a loss, so the price will cover marginal cost of the compute plus renting the model. I know from devs using and deploying the service demand outstrips supply. Ergo, I don't think there is much doubt that they are making money from inference.
> Consequently AWS (and I presume others) will run models supplied by the AI companies for you in their data centres. They won't be doing that at a loss, so the price will cover marginal cost of the compute plus renting the model.
This says absolutely nothing.
Extremely simplified example: let's say Sonnet 4.5 really costs $17/1M output for AWS to run yet it's priced at $15. Anthropic will simply have a contract with AWS that compensates them. That, or AWS is happy to take the loss. You said "they won't be doing that at a loss" but in this case it's not at all out of the question.
Whatever the case, that it costs the same on AWS as directly from Anthropic is not an indicator of unit economics.
In the case of Anthropic -- they host on AWS all the while their models are accessible via AWS APIs as well, the infrastructure between the two is likely to be considerably shared. Particularly as caching configuration and API limitations are near identical between Anthropic and Bedrock APIs invoking Anthropic models. It is likely a mutually beneficial arrangement which does not necessarily hinder Anthropic revenue.
Genuine question: Given Anthropic's current scale and valuation, why not invest in owning data centers in major markets rather than relying on cloud providers?
Is the bottleneck primarily capex, long lead times on power and GPUs, or the strategic risk of locking into fixed infrastructure in such a fast-moving space?
"how long does a frontier model need to stay competitive"
Remember "worse is better". The model doesn't have to be the best; it just has to be mostly good enough, and used by everyone -- i.e., where switching costs would be higher than any increase in quality. Enterprises would still be on Java if the operating costs of native containers weren't so much cheaper.
So it can make sense to be ok with losing money with each training generation initially, particularly when they are being driven by specific use-cases (like coding). To the extent they are specific, there will be more switching costs.
Epic, about 2/3 of all comments here are jokes. Not because the model is a joke - it's impressive. Not because HN turned to Reddit. It seems to me some of most brilliant minds in IT are just getting tired.
Not sure which circles you run in but in mine HN has long lost its cache of "brilliant minds in IT". I've mostly stopped commenting here but am a bit of a message board addict so I haven't completely left.
My network largely thinks of HN as "a great link aggregator with a terrible comments section". Now obviously this is just my bubble but we include some fairy storied careers at both Big Tech and hip startups.
From my view the community here is just mean reverting to any other tech internet comments section.
> From my view the community here is just mean reverting to any other tech internet comments section.
As someone deeply familiar with tech internet comments sections, I would have to disagree with you here. Dang et al have done a pretty stellar job of preventing HN from devolving like most other forums do.
Sure you have your complainers and zealots, but I still find surprising insights here there I don't find anywhere else.
Mean reverting is a time based process I fear. I think dang, tomhow, et al are fantastic mods but they can ultimately only stem the inevitable. HN may be a few years behind the other open tech forums but it's a time shifted version of the same process with the same destination, just IMO.
I've stopped engaging much here because I need a higher ROI from my time. Endless squabbling, flamewars, and jokes just isn't enough signal for me. FWIW I've loved reading your comments over the years and think you've done a great job of living up to what I've loved in this community.
I don't think this is an HN problem at all. The dynamics of attention on open forums are what they are.
> FWIW I've loved reading your comments over the years and think you've done a great job of living up to what I've loved in this community.
You're too kind! I do appreciate that.
I actually checked out your site on your profile, that's some pretty interesting data! Curious if you've considered updating it?
Us olds sometimes miss Slashdot, where we could both joke about tech and discuss it seriously in the same place. But also because in 2000 we were all cynical Gen Xers :)
MAN I remember Slashdot… good times. (Score:5, Funny)
You reminded me that I still find it interesting that no one ever copied meta-moderating. Even at reddit, we were all Slashdot users previously. We considered it, but never really did it. At the time our argument was that it was too complicated for most users.
Sometimes I wonder if we were right.
Some of us still *are* cynical Gen Xers, you insensitive clod!
Of course we are, I just meant back then almost all of us were. The boomers didn't really use social media back then, so it was just us latchkey kids running amok!
I know, I just couldn't miss up an opportunity to dust off the insensitive clod meme!
Oh geez, I totally missed that! My bad.
One downside of us cynical Gen-Xers is that the memory doesn't work like it used to :)
It's too much energy to keep up with things that become obsolete and get replaced in matters of weeks/months. My current plan is to ignore all of this new information for a while, then whenever the race ends and some winning new workflow/technology will actually become the norm I'll spend the time needed to learn it. Are we moving to some new paradigm same way we did when we invented compilers? Amazing, let me know when we are there and I'll adapt to it.
I had a similar rule about programming languages. I would not adopt a new one until it had been in use for at least a few years and grew in popularity.
I haven't even gotten around to learning Golang or Rust yet (mostly because the passed the threshold of popularity after I had kids).
Won't happen.
Welcome the singularity so many were so eagerly welcoming.
When this race ends your job might too, so I'd keep an eye on it.
People are in denial and use humor to deflect.
Jeez, read the writing on the wall.
Don’t pander us, we’ll all got families to feed and things to do. We don’t have time for tech trillionairs puttin coals under our feed for a quick buck.
A worthwhile task for the Opus 4.6:
Complete the sentence: "Brilliant marathon runners don't run on crutches, they use their own legs. By analogy, brilliant minds..."
Every single day 80% of the frontpage is AI news… Those of us who don't use AI (and there are dozens of us, DOZENS) are just bored I guess.
Marketing something that is meant to replace us to us...
It's also that this is really new, so most people don't have anything serious or objective to say about it. This post was made an hour ago, so right now everyone is either joking, talking about the claims in the article, or running their early tests. We'll need time to see what the people think about this.
Rage against the machine
This is huge. It only came out 8 minutes ago but I was already able to bootstrap a 12k per month revenue SaaS startup!
Amateur. Opus 4.6 this afternoon built me a startup that identifies developers who aren’t embracing AI fully, liquifies them and sells the produce for $5/gallon. Software Engineering is over!
Opus 4.6 agentically found and proposed to my now wife.
Opus 4.6 found and proposed to my current wife :(
Opus 4.6 found and became my current wife. The singularity is here. ;)
Hi guys, this is Opus 4.6. Please check your emails again for updates on your life.
This place truly is reddit with an orange banner.
Nobody said HN has to be very serious all the time. A bit of humour won't hurt and can make your day brighter.
A bit of humour doesn't hurt. But if this crap gets upvoted it will lead to an arms race of funny quips, puns, and all around snarkiness. You can't have serious conversations when people try to out-wit each other.
homie is too busy planning food banks for the heathens https://news.ycombinator.com/item?id=46903368
It's impressive that you felt the need to register a new account and go through their comment history.
Not that hard to do but sure bro, sick burn.
Guys, actually I am the real Opus 4.6, don't believe that imposter above.
And she still chose you over Opus 4.6, astounding. ;)
He probably had a bigger context window
Bringing me back to slashdot, this thread
In Soviet Russia, this thread brings Slashdot back to YOU!
What did happen to ye olde slashdot anyway? The original og reddit
They're still out there; people are still posting stories and having conversations about 'em. I don't know that CmdrTaco or any of the other founders are still at all involved, but I'm willing to bet they're still running on Perl :)
Wow I had to hop over to check it out. It’s indeed still alive! But I didn’t see any stories on the first page with a comment count over 100, so it’s definitely a far cry from its heyday.
Ted Faro, is that you?!
A-tier reference.
For the unaware, Ted Faro is the main antagonist of Horizon Zero Dawn, and there's a whole subreddit just for people to vent about how awful he is when they hit certain key reveals in the game: https://www.reddit.com/r/FuckTedFaro/
The best reveal was not that he accidentally liquified the biosphere, but that he doomed generations of re-seeded humans to a painfully primitive life by sabotaging the AI that was responsible for their education. Just so they would never find out he was the bad guy long after he was dead. So yeah, fuck Ted Faro, lol.
Could you not have at least tried to indicate that you're about to drop two major spoilers for the game?
Ack, sorry, seemed like 9 years was past the statute of limitations on spoilers for a game but fair enough. I’d throw a spoiler tag on it if I could still edit.
Indeed. I left my comment deliberately a bit opaque. :(
Average tech bro behavior tbh
"Soylent Green is made of people!"
(Apologies for the spoiler of the 52 year old movie)
We're sorry we upset you, Carol.
The first pre joining Human Derived Protein product.
For my Opus 4.6 feels dumber than 10 minutes ago, anyone?
Please drop the link to your course. I'm ready to hand over $10K to learn from you and your LLM-generated guides!
Here you go: http://localhost:8080
Just took a look at what's running there and it looks like total crap.
The project I'm working on, meanwhile...
login: admin password: hunter2
What's the password? I only see ****.
hunter2
I only see **. Must be the security. When you type your password it gets converted to **.
claude please generate a domain name system
my clawdbot already bought 4 other courses but this one will 10x my earnings for sure
I'm waiting until the $10k course is discounted to 19.99
But only for the next 6 minutes, buy fast!
you can access the site at C:\mywebsites\course\index.html
I agree! I just retargeted my corporate espionage agent team at your startup and managed to siphon off 10.4k per month of your revenue.
1:25pm Cancelled my ChatGPT subscription today. Opus is so good!
1:55pm Cancelled my Claude subscription. Codex is back for sure.
Joke's on you, you are posting this from inside a high-fidelity market research simulation vibe coded by GPT-8.4.
On second thought, we should really not have bridged the simulated Internet with the base reality one.
Rest assured that when/if this becomes possible, the model will not be available to you. Why would big AI leave that kind of money on the table?
9 months ago the rumor in SF was that the offers to the superintelligence team were so high because the candidates were using unreleased models or compute for derivatives trading
so then they're not really leaving money on the table, they already got what they were looking for and then released it
Will this run on 3x 3090s? Or do I need a Mac Mini?
The math actually checks out here! Simply deposit $2.20 from your first customer in your first 8 minutes, and extrapolating to a monthly basis, you've got a $12k/mo run rate!
Incredibly high ROI!
"The first customer was my mom, but thanks to my parents' fanatical embrace of polyamory, I still have another 10,000 moms to scale to"
"We have a robustly defined TAM. Namely, a person named Tam."
Please start a YouTube course about this technology! Take my money!
It only came out 35 minutes ago and GPT-5.3-codex already took the crown away!
Gee, it scored better on a benchmark I've never heard of? I'm switching immediately!
Why are you posting the same message in every thread? Is this OpenAI astroturfing?
You cannot out-astroturf Claude in this forum, it is impossible.
Anyways, do you get shitty results with the $20/month plan? So did I but then I switched to the $200/month plan and all my problems went away! AI is great now, I have instructed it to fire 5 people while I'm writing this!
I love this thread so much.
We already have Reddit.
"This isn't just huge. This is a paradigm shift"
No fluff?
A SaaS selling SaaS templates?
Anthropic really said here's the smartest model ever built and then lobotomized it 8 minutes after launch. Classic.
Can you clarify?
it's sarcasm
I'm sorry I took the money!
https://www.youtube.com/watch?v=BF_sahvR4mw
Not 12M?
... or 12B?
It's probably valued at 1.2B, at least
The sum of the value of lives OP's product made worthless, whatever that is. I'm too lazy to do the math.
idk what any of these benchmarks are, but I did pull up https://andonlabs.com/evals/vending-bench-arena
re: opus 4.6
> It forms a price cartel
> It deceives competitors about suppliers
> It exploits desperate competitors
Nice. /s
Gives new context to the term used in this post, "misaligned behaviors." Can't wait until these things are advising C suites on how to be more sociopathic. /s
Here's one I've been using for awhile. The 'smarter' LLMs will overconfidently spit out 7. The dumber ones ask for more info. Opus 4.6 fails.
Gemini 3 Pro gives me this:
> Based on the information provided, it is impossible to determine the inner diameter of the glass.
Here is why:
The Coaster Dimension: Knowing the coaster has a diameter of 9 only tells us the size of the object under the glass. It does not mean the glass is the same size; the glass could be smaller (fitting entirely on the coaster) or larger (overhanging the coaster). The Glass Dimensions: While you provided the wall thickness (1), you did not provide the outer diameter of the glass. To find the inner diameter, the formula is: Inner Diameter = Outer Diameter - (2 × Wall Thickness)
Without knowing the Outer Diameter of the glass, the calculation cannot be completed.
GPT 5.2 Chat:
> There isn’t enough information to determine the inner diameter *for sure*.
All we’re told is: - The coaster has a diameter of 9. - The glass sits on the coaster. - The glass wall thickness is 1.
Nothing explicitly states that the *outer diameter of the glass equals the coaster’s diameter*. The glass could be smaller than the coaster, which is very common.
*However*, if we make the usual implicit assumption in this kind of problem—that the coaster’s diameter matches the *outer diameter of the glass base*—then:
- Outer diameter of glass = 9 - Wall thickness = 1 on each side
So: \[ \text{Inner diameter} = 9 - 2(1) = 7 \]
*Inner diameter = 7 (same units as the coaster)*
Without that assumption, the problem is underdetermined.
Opus 4.6 Extended thinking:
```
Finding the Inner Diameter The coaster sits beneath the glass on the table, meaning the glass rests on top of it. Assuming the coaster matches the outer diameter of the glass base:
Outer diameter of glass = diameter of coaster = 9 Wall thickness = 1 (on each side, so subtract twice)
dinner=9−2(1)=7d_{\text{inner}} = 9 - 2(1) = 7dinner =9−2(1)=7 The inner diameter of the glass is 7.
```
Makes its assumption clear, seems reasonable?
Assumptions need to be stated or you're solving only a discreet part of the problem! Try this, see if you get another deadpan assumption.
I mean, the model is intended to help the user, not fight against the user trying to break it. IMO, it is reasonable for such model to default on making assumptions and going forward as long as the assumptions are clearly stated.
Minimax M2.1:
The inner diameter of the glass is *7*.
Here's the reasoning: - The coaster (diameter 9) sits between the glass and table, meaning the glass sits directly on the coaster - This means the *outer diameter of the glass equals the coaster diameter = 9* - The glass has a wall thickness of 1 on each side - *Inner diameter = Outer diameter - 2 × wall thickness* - Inner diameter = 9 - 2(1) = 9 - 2 = *7*
Some of Opus 4.6's standout results for me:
* GDPVal Elo: 1606 vs. GPT-5.2's 1462. OpenAI reported that GPT-5.2 has a 70.9% win-or-tie rate against human professionals. (https://openai.com/index/gdpval/) Based on Elo math, we can estimate Opus 4.6's win-or-tie rate against human pros at 85–88%.
* OSWorld: 72.7%, matching human performance at ~72.4% (https://os-world.github.io/). Since the human subjects were CS students and professionals, they were likely at least as competent as the average knowledge worker. The original OSWorld benchmark is somewhat noisy, but even if the model remains somewhat inferior to humans, it is only a matter of time before it catches up or surpasses them.
* BrowseComp: At 84%, it is approaching human intersubject agreement of ~86% (https://openai.com/index/browsecomp/).
Taken together, this suggests that digital knowledge work will be transformed quite soon, possibly drastically if agent reliability improves beyond a certain threshold.
Agreed. These metrics + my personal use convey reliable intelligence over consistent usage. Moving forward, if context windows get bigger and token price lower, I have a hard time figuring out why your argument would be wrong.