It's all very clear when you mentally replace "LLM" with "text completion driven by compressed training data".
E.g.
[Text copletion driven by compressed training data] exhibit[s] a puzzling inconsistency: [it] solves complex problems yet frequently fail[s] on seemingly simpler ones.
Some problems are better represented by a locus of texts in the training data, allowing more plausible talk to be generated. When the problem is not well represented, it does not help that the problem is simple.
If you train it on nothing but Scientology documents, and then ask about the Buddhist perspective on a situation, you will probably get some nonsense about body thetans, even if the situation is simple.
I have a hard time trying to conceptualize lossy text compression, but I've recently started to think about the "reasoning"/output as just a by product of lossy compression, and weights tending towards an average of the information "around" the main topic of prompt. What I've found easier is thinking about it like lossy image compression, generating more output tokens via "reasoning" is like subdividing nearby pixels and filling in the gaps with values that they've seen there before. Taking the analogy a bit too far, you can also think of the vocabulary as the pixel bit depth.
I definitely agree replacing AI or LLMs with "X driven by compressed training data" starts to make a lot more sense, and a useful shortcut.
You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens. I find it easier to conceptualize this in three dimensions. 3blue1brown has a good video series which covers the overall concept of LLM vectors in machine learning: https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_...
To give a concrete example, say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer? By adding more relevant tokens (honey, worker, hive, beeswax) we steer the token generation to the place in the "word cloud" where our next token is more likely to exist.
I don't see LLMs as "lossy compression" of text. To me that implies retrieval, and Transformers are a prediction device, not a retrieval device. If one needs retrieval then use a database.
> You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens.
I like to frame it as a theater-script cycling through the LLM. The "reasoning" difference is just changing the style so that each character has film noir monologues. The underlying process hasn't really changes, and the monologues text isn't fundamentally different from dialogue or stage-direction... but more data still means more guidance for each improv-cycle.
> say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer?
I'd like to point out that this scheme can result in things that look better to humans in the end... even when the "clarifying" choice is entirely arbitrary and irrational.
In other words, we should be alert to the difference between "explaining what you were thinking" versus "picking a firm direction so future improv makes nicer rationalizations."
It is not a useful shortcut because you don't know what the training data is, nothing requires it to be an "average" of anything, and post-training arbitrarily re-weights all of its existing distributions anyway.
Well, that's what a LLM is. The problem is if one's mental model is built on "AI" instead of "LLM."
The fact that LLMs can abstract concepts and do any amount of out-of-sample reasoning is impressive and interesting, but the null hypothesis for a LLM being "impressive" in any regard is that the data required to answer the question is present in it's training set.
This is true, but also misleading. We are learning that the models achieve compression by distilling higher level concepts and deriving generalized human like abilities, for example the recent introspection paper from Anthropic.
Thank you for posting this. I'm struck with how there is a lot of studying of the behavior and isolating it from other assumptions and then these individual capabilities are then described as a new solution or discovered capability that would work with all of those other assumptions. This makes most all of the LLM research feel like whack a mole if the goal was to make accurate and reliable models by understanding these techniques. Instead, it's more like seeing faces in cars and buildings and other artifacts of patterns and pattern groupings and recognition of patterns. Building houses on sand, etc.
I haven't read this particular paper in-depth, but it reminds me of another one I saw that used a similar approach to find if the model encodes its own certainty of answering correctly. https://arxiv.org/abs/2509.10625
Probably irrelevant, but something funny about claude code is it will routinely say something like "10 week task, very complex", and then one-shot it in 2 minutes. I didn't have it create a feature for a while because it kept telling me it's way too complicated. All of the open source versions I tried weren't working, but I finally just decided to get it to make the feature anyways and it ended up doing better than the open source projects. So there's something off about how well claude estimates the difficulty of things for it, and I'm wondering if that makes it perform worse by not doing things it would do well at.
Firstly, Claude's self concept is based around humanity's collective self-concept. (Well, the statistical average of all the self-concepts on the internet.)
So it doesn't have a clear understanding of what LLMs' strengths and weaknesses are, and itself by extension. (Neither do we, from what I gathered. At least, not in a way that's well represented in web scrapes ;)
Secondly, as a programmer I have noticed a similar pattern... stuff that people say is easy turns out to be a pain in the ass, and stuff that they say is impossible turns out to be trivial. (They didn't even try, they just repeated what other people told them was hard, who also didn't try it...)
Not sure how related this is, but I've noticed it has a tendency to start sentences with usually inflated optimism and I think the idea is that if it has a tendency to intro with "Aha I see it now! The problem is" whatever comes next has a higher tendency to be a correct solution than if you didn't use an overtly positive prefix, even if that leads to a lot of annoying behavior.
It's all very clear when you mentally replace "LLM" with "text completion driven by compressed training data".
E.g.
[Text copletion driven by compressed training data] exhibit[s] a puzzling inconsistency: [it] solves complex problems yet frequently fail[s] on seemingly simpler ones.
Some problems are better represented by a locus of texts in the training data, allowing more plausible talk to be generated. When the problem is not well represented, it does not help that the problem is simple.
If you train it on nothing but Scientology documents, and then ask about the Buddhist perspective on a situation, you will probably get some nonsense about body thetans, even if the situation is simple.
I have a hard time trying to conceptualize lossy text compression, but I've recently started to think about the "reasoning"/output as just a by product of lossy compression, and weights tending towards an average of the information "around" the main topic of prompt. What I've found easier is thinking about it like lossy image compression, generating more output tokens via "reasoning" is like subdividing nearby pixels and filling in the gaps with values that they've seen there before. Taking the analogy a bit too far, you can also think of the vocabulary as the pixel bit depth.
I definitely agree replacing AI or LLMs with "X driven by compressed training data" starts to make a lot more sense, and a useful shortcut.
You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens. I find it easier to conceptualize this in three dimensions. 3blue1brown has a good video series which covers the overall concept of LLM vectors in machine learning: https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_...
To give a concrete example, say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer? By adding more relevant tokens (honey, worker, hive, beeswax) we steer the token generation to the place in the "word cloud" where our next token is more likely to exist.
I don't see LLMs as "lossy compression" of text. To me that implies retrieval, and Transformers are a prediction device, not a retrieval device. If one needs retrieval then use a database.
> You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens.
I like to frame it as a theater-script cycling through the LLM. The "reasoning" difference is just changing the style so that each character has film noir monologues. The underlying process hasn't really changes, and the monologues text isn't fundamentally different from dialogue or stage-direction... but more data still means more guidance for each improv-cycle.
> say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer?
I'd like to point out that this scheme can result in things that look better to humans in the end... even when the "clarifying" choice is entirely arbitrary and irrational.
In other words, we should be alert to the difference between "explaining what you were thinking" versus "picking a firm direction so future improv makes nicer rationalizations."
It is not a useful shortcut because you don't know what the training data is, nothing requires it to be an "average" of anything, and post-training arbitrarily re-weights all of its existing distributions anyway.
Well, that's what a LLM is. The problem is if one's mental model is built on "AI" instead of "LLM."
The fact that LLMs can abstract concepts and do any amount of out-of-sample reasoning is impressive and interesting, but the null hypothesis for a LLM being "impressive" in any regard is that the data required to answer the question is present in it's training set.
This is true, but also misleading. We are learning that the models achieve compression by distilling higher level concepts and deriving generalized human like abilities, for example the recent introspection paper from Anthropic.
> Text copletion driven by compressed training data...solves complex problems
Sure it does. Obviously. All we ever needed was some text completion.
Thanks for your valuable insight.
Thank you for posting this. I'm struck with how there is a lot of studying of the behavior and isolating it from other assumptions and then these individual capabilities are then described as a new solution or discovered capability that would work with all of those other assumptions. This makes most all of the LLM research feel like whack a mole if the goal was to make accurate and reliable models by understanding these techniques. Instead, it's more like seeing faces in cars and buildings and other artifacts of patterns and pattern groupings and recognition of patterns. Building houses on sand, etc.
Sound a lot like Kolmogorov complexity
I haven't read this particular paper in-depth, but it reminds me of another one I saw that used a similar approach to find if the model encodes its own certainty of answering correctly. https://arxiv.org/abs/2509.10625
Probably irrelevant, but something funny about claude code is it will routinely say something like "10 week task, very complex", and then one-shot it in 2 minutes. I didn't have it create a feature for a while because it kept telling me it's way too complicated. All of the open source versions I tried weren't working, but I finally just decided to get it to make the feature anyways and it ended up doing better than the open source projects. So there's something off about how well claude estimates the difficulty of things for it, and I'm wondering if that makes it perform worse by not doing things it would do well at.
In terms of the time estimates: I've added to my global rules to never give time estimates for tasks, as they're useless and inaccurate.
I think there's two aspects to this.
Firstly, Claude's self concept is based around humanity's collective self-concept. (Well, the statistical average of all the self-concepts on the internet.)
So it doesn't have a clear understanding of what LLMs' strengths and weaknesses are, and itself by extension. (Neither do we, from what I gathered. At least, not in a way that's well represented in web scrapes ;)
Secondly, as a programmer I have noticed a similar pattern... stuff that people say is easy turns out to be a pain in the ass, and stuff that they say is impossible turns out to be trivial. (They didn't even try, they just repeated what other people told them was hard, who also didn't try it...)
I wonder if it's trying to predict what kind of estimate a human engineer would provide.
Considering it’s trained on predicting the next word in stuff humans estimated before AI, wouldn’t that make sense?
Not sure how related this is, but I've noticed it has a tendency to start sentences with usually inflated optimism and I think the idea is that if it has a tendency to intro with "Aha I see it now! The problem is" whatever comes next has a higher tendency to be a correct solution than if you didn't use an overtly positive prefix, even if that leads to a lot of annoying behavior.