How does misalignment scale with model intelligence and task complexity?

228 points | by salkahfi 17 hours ago

70 comments

jmtulloss 16 hours ago
The comments so far seem focused on taking a cheap shot, but as somebody working on using AI to help people with hard, long-term tasks, it's a valuable piece of writing.
- It's short and to the point
- It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term
- It's informative on how these models work, informed by some of the best in the business
- It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")
[-]
- kernc 15 hours ago
  Other actionable insights are:
  - Merge amendments up into the initial prompt.
  - Evaluate prompts multiple times (ensemble).
  [-]
  - sandos 7 hours ago
    Sometimes when I was stressed, I have used several models to verify each others´ work. They usually find problems, too!
    This is very useful for things that take time to verify, we have CI stuff that takes 2-3 hours to run and I hate when those fails because of a syntax error.
    [-]
    - xmcqdpt2 4 hours ago
      Syntax errors should be caught by type checking / compiling/ linting. That should not take 2-3 hours!
gopalv 17 hours ago
> Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.
Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.
My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.
So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.
Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.
We can escalate to higher authority and get out of that mess faster if we fail hard and early.
The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.
Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.
[1] - https://arxiv.org/abs/2601.14351
[-]
- Nevermark 12 hours ago
  > Coherence requires 2 opposing forces
  This seems very basic to any kind of information processing beyond straight shot predictable transforms.
  Expansion and reduction of possibilities, branches, scope, etc.
  Biological and artificial neural networks converging into multiple signals, that are reduced by competition between them.
  Scientific theorizing, followed by experimental testing.
  Evolutionary genetic recombination and mutation, winnowed back by resource competition.
  Generation, reduction, repeat.
  In a continually coordinated sense too. Many of our systems work best by encouraging simultaneous cooperation and competition.
  Control systems command signal proportional to demand, vs. continually reverse-acting error feedback.
  [-]
  - gopalv 11 hours ago
    > This seems very basic
    Yes, this is not some sort of hard-fought wisdom.
    It should be common sense, but I still see a lot of experiments which measure the sound of one hand clapping.
    In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.
    If you don't really want the experiments and data from the academic paper, we have a white paper which is completely obvious to anyone who's read High Output Management, Mythical Man Month and Philosophy of Software Design recently.
    Nothing in there is new, except the field it is applied to has no humans left.
    [-]
    - Nevermark 9 hours ago
      > Yes, this is not some sort of hard-fought wisdom.
      By basic I didn't mean uninteresting.
      In fact, despite the pervasiveness and obviousness of the control and efficiency benefits of push-pull, generating-reducing, cooperation-competition, etc., I don't think I have ever seen any kind of general treatment or characterization that pulled all these similar dynamics together. Or a hierarchy of such.
      > In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.
      I think it is the fact that the agents are operating coherently with the respective complementary goals. Whereas, asking one agent to both solve and judge creates conflicting constraints before a solution has begun.
      Creative friction.
      I am reminded of brainstorming sessions, where it is so important to note ideas, but not start judging them, since who knows what crazy ideas will fit or spark together. Later they can be selected down.
      So we institutionalize this separation/staging with human teams too, even if it is just one of us (within our context limits, over two inference sessions :).
- maxkfranz 14 hours ago
  More or less, delegation and peer review.
CuriouslyC 17 hours ago
This is a good line: "It found that smarter entities are subjectively judged to behave less coherently"
I think this is twofold:
1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.
2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.
[-]
- energy123 16 hours ago
  Incoherence is not error.
  You can have a vanishingly small error and an incoherence at its max.
  That would be evidence of perfect alignment (zero bias) and very low variance.
- xanderlewis 16 hours ago
  What do 'domain valleys' and 'tunneling' mean in this context?
  [-]
  - FuckButtons 13 hours ago
    So, the hidden mental model that the OP is expressing and failed to elucidate on is that llm’s can be thought of as compressing related concepts into approximately orthogonal subspaces of the vector space that is upper bounded by the superposition of all of their weights. Since training has the effect of compressing knowledge into subspaces, a necessary corollary of that fact is that there are now regions within the vector space that contain nothing very much. Those are the valleys that need to be tunneled through, ie the model needs to activate disparate regions of its knowledge manifold simultaneously, which, seems like it might be difficult to do. I’m not sure if this is a good way of looking at things though, because inference isn’t topology and I’m not sure that abstract reasoning can be reduced down to finding ways to connect concepts that have been learned in isolation.
  - esafak 15 hours ago
    A hallmark of intelligence is the ability to find connections between the seemingly disparate.
    [-]
    - Earw0rm 4 hours ago
      That's also a hallmark of some mental/psychological illnesses (paranoid schizophrenia family) and use of certain drugs, particularly hallucinogens.
      The hallmark of intelligence in this scenario is not just being able to make the connections, but being able to pick the right ones.
    - ithkuil 5 hours ago
      The word "seemingly" is doing a lot of work here.
      Sometimes things that look very different actually are represented with similar vectors in latent space.
      When that happens to us it "feels like" intuition; something you can't really put a finger on and might require work to put into a form that can be transferred to another human that has a different mental model
    - w10-1 9 hours ago
      Actually, a hallmark could be to prune illusory connections, right? That would decrease complexity rather than amplifying it.
      [-]
      - esafak 4 hours ago
        Yes, that also happens, for example when someone first said natural disasters are not triggered by offending gods. It is all about making explanations as simple as possible but no simpler.
    - TonyStr 11 hours ago
      Does this make conspiracy theorists highly intelligent?
      [-]
      - gylterud 9 hours ago
        No, but they emulate intelligence by making up connections between seemingly disparate things, where there are none.
        [-]
        Earw0rm 4 hours ago
        They make connections but lack the critical thinking skills to weed out the bad/wrong ones.
        Which is why, just occasionally, they're right, but mostly by accident.
  - esyir 16 hours ago
    Not the OP, but my interpretation here is that if you model the replies as some point in a vector space, assuming points from a given domain cluster close to each other, replies that span two domains need to "tunnel" between these two spaces.
- booleandilemma 10 hours ago
  > the ability to traverse between domain valleys in the cognitive manifold.
  Couldn't you have just said "know about a lot of different fields"? Was your comment sarcastic or do you actually talk like that?
  [-]
  - reverius42 7 hours ago
    I think they mean both "know about a lot of different fields" and also "be able to connect them together to draw inferences", the latter perhaps being tricky?
    [-]
    - booleandilemma 4 hours ago
      Maybe? They should speak more clearly regardless, so we don't have to speculate over it. The way you worded it is much more understandable.
- p-e-w 15 hours ago
  > When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.
  Insights are “deep” not on their own merit, but because they reveal something profound about reality. Such a revelation is either testable or not. If it’s testable, distinguishing it from bullshit is relatively easy, and if it’s not testable even in principle, a good heuristic is to put it in the bullshit category by default.
  [-]
  - CuriouslyC 15 hours ago
    This was not my experience studying philosophy. After Kant there was a period where philosophers were basically engaged in a centuries long obfuscated writing competition. The pendulum didn't start to swing back until Neitchze. It reminded me of legal jargon but more pretentious and less concrete.
    [-]
    - root_axis 14 hours ago
      It seems to me that your anecdote exemplifies the their point.
  - skydhash 15 hours ago
    The issue is the revelation. It's always individual at some level. And don't forget our senses are crude. The best way is to store "insights" as information until we collect enough data that we can test it again (hopefully without a lot of bias). But that can be more than a lifetime work, so sometimes you have to take some insights at face value based on heuristics (parents, teachers, elder, authority,...)
loudmax 2 hours ago
This paper indicates that we should probably be less fearful of Terminator style accidental or emergent AI-misalignment. At least, as far as the existing auto-regressive LLM architecture is concerned. We may want to revisit these concerns if and when other types of artificial general intelligent models are deployed.
The "mis-alignment" we do need to worry about is intentional. Naturally, the hyperscalers are deploying these models in order to benefit themselves. Ideally, customers will select models that are most grounded and accurate. In practice, there's a danger that people will select models that tell them what they want to hear, rather than what they should hear. We've seen this with journalism and social media.
The other danger is that absent a competitive marketplace for AI, a single corporation or a cartel will shape the narrative. The market valuations of some AI providers seem to be based on this assumption.
bob1029 9 hours ago
You simply can't have a single shot context with so many simultaneous constraints and expect to make forward progress. This cannot be solved with additional silicon, power or data.
Smaller prompts and fewer tools tends to be more stable. I try to stay within 1000 tokens and 10 tools for a single inference pass. I become visibly amused when I read many of the system prompts out there. Anthropomorphism is the biggest anti pattern with these models. It's a very easy and comfortable trap to fall into.
The core issue I see with coding agents is that the moment you read a file, you've polluted the context in terms of token coherence. It's probably not critical in most cases, but it's safer to pretend like it is. Recursive/iterative decomposition of the problem is the only thing I've seen so far that can scale arbitrarily. For example, if you invoke a sub agent every time you read a file, you can reduce the impact to the token budget of the caller by orders of magnitude. The callee can return a brief summary or yes/no response to the caller after reading 500kb of source. This applies at each level of recursion and can compound dramatically (exponentially) over just a few nested calls.
smy20011 16 hours ago
I think It's not because AI working on "misaligned" goals. The user never specify the goal clearly enough for AI system to work.
However, I think producing detailed enough specification requires same or even larger amount of work than writing code. We write rough specification and clarify these during the process of coding. I think there are minimal effort required to produce these specification, AI will not help you speed up these effort.
[-]
- xmcqdpt2 4 hours ago
  As of today though, that doesn't work. Even straightforward tasks that are perfectly spec-ed can't be reliably done with agents, at least in my experience.
  I recently used Claude for a refactor. I had an exact list of call sites, with positions etc. The model had to add .foo to a bunch of builders that were either at that position or slightly before (the code position was for .result() or whatever.) I gave it the file and the instruction, and it mostly did it, but it also took the opportunity to "fix" similar builders near those I specified.
  That is after iterating a few times on the prompt (first time it didn't want to do it because it was too much work, second time it tried to do it via regex, etc.)
- crabmusket 16 hours ago
  That makes me wonder about the "higher and higher-level language" escalator. When you're writing in assembly, is it more work to write the code than the spec? And the reverse is true if you can code up your system in Ruby? If so, does that imply anything about the "spec driven" workflow people are using with AIs? Are we right on the cusp where writing natural language specs and writing high level code are comparably productive?
  [-]
  - jaggederest 15 hours ago
    I believe that the issue right now is that we're using languages designed for human creation in an AI context. I think we probably want languages that are optimized for AI written but human read code, so the surface texture is a lot different.
    My particular hypothesis on this is something that feels a little bit like python and ruby, but has an absolutely insane overkill type system to help guide the AI. I also threw in a little lispiness on my draft: https://github.com/jaggederest/locque/
    [-]
    - gf000 9 hours ago
      I don't know, LLMs strive on human text, so I would wager that a language designed for humans would quite closely match an ideal one for LLMs. Probably the only difference is that LLMs are not "lazy", they better tolerate boilerplate, and lower complexity structures likely fit them better. (E.g. they can't really one-shot understand some imported custom operator that is not very common in its training data)
      Also, they rely surprisingly closely on "good" code patterns, like comments and naming conventions.
      So if anything, a managed language [1] with a decent type system and not a lot of features would be the best, especially if it has a lot of code in its training data. So I would rather vote on Java, or something close.
      [1] reasoning about life times, even if aided by the compiler is a global property, and LLMs are not particularly good at that
      [-]
      - hnaccount_rng 4 hours ago
        But that is leas fundamental then you make it sound. LLMs work well with human language because that’s all they are trained on. So what else _could_ an ideal language possible look like?
        On the other hand: the usefulness of LLMs will always be gated by their interface to the human world. So even if their internal communication might be superseded at some point. Their contact surface can only evolve if their partners/subjects/masters can interface
  - charcircuit 16 hours ago
    If you are on the same wave length as someone you don't need to produce a full spec. You can trust that the other person has the same vision as you and will pick reasonable ways to implement things. This is one reason why personalized AI agents are important.
  - skydhash 15 hours ago
    Programming languages can be a thinking tool for a lot of tasks. Very much like a lot of notation, like music sheet and map drawing. A condensed and somewhat formal manner of describing ideas can increase communication speed. It may lack nuance, but in some case, nuance is harmful.
    The nice thing about code compared to other notation is that it's useful on its. You describe an algorithm and the machine can then solve the problem ad infinitum. It's one step instead of the two step of writing a spec and having an LLM translate it, then having to verify the output and alter it.
    Assembly and high level languages are equivalent in terms of semantics. The latter helps in managing complexity, by reducing harmful possibilities (managing memory, off-by-one errors) and presenting common patterns (iterators/collections, struct and other data structures, ....) so that categories of problems are easily solved. There's no higher level of computing model unlocked. Just faster level of productivity unlocked by following proven patterns.
    Spec driven workflow is a mirage, because even the best specs will leave a lot of unspecified details. Which are crucial as most of programming is making the computer not do the various things it can do.
    [-]
    - crabmusket 15 hours ago
      > most of programming is making the computer not do the various things it can do
      This is a very stimulating way of putting it!
- dmix 15 hours ago
  > I think producing detailed enough specification requires same or even larger amount of work than writing code
  Our team has started dedicating much more time writing documentation for our SaaS app, no one seems to want to do it naturally, but there is very large potential for opening your system to machine automation. Not just for coding but customer facing tooling. I saw a preview of that possible future using NewRelic where they have an AI chat use their existing SQL-like query language to build tables and charts from natural language queries right in the web app. Theirs kinda sucks but there's so much potential there that it is very likely going to change how we build UIs and software interfaces.
  Plus it also helps sales, support, and SEO having lots of documentation on how stuff works.
- hogehoge51 14 hours ago
  My thought too. To extend this coding agents will make code cheap, specifications cheaper, but may also invert the relative opportunity cost of not writing a good spec.
- cobblestone32 5 hours ago
  > The user never specify the goal clearly enough for AI system to work.
  This is sort of a fundamental problem with all AI. If you tell a robot assistant to "make a cup of tea", how's it supposed to know that that implies "don't break the priceless vase in the kitchen" and "don't step on the cat's tail", et cetera. You're never going to align it well enough with "human values" to be safe. Even just defining in human-understandable terms what those values are is a deep existential question of philosophy, let alone specifying it for a machine that's capable of acting in the world independently.
anupamchugh 12 hours ago
The "natural overthinking increases incoherence" finding matches my daily experience with Claude.
I maintain ~100 custom skills (specialized prompts). Sometimes Claude reads a skill, understands it, then overthinks itself into "helpful" variations that break the workflow.
Has anyone else found prompt density affects coherence?
[-]
- anupamchugh 3 hours ago
  Following up - I built a tool "wobble"[1] to measure this: parses ~/.claude/projects/*.jsonl session transcripts, extracts skill invocations + actual commands executed, calculates Bias/Variance per the paper's formula.
  Ran it on my sessions. Result: none of skills scored STABLE. The structural predictors of high variance: Numbered steps without clear default, Options without (default) marker, Content >4k chars (overthinking zone), Missing constraint language
  [1] https://github.com/anupamchugh/shadowbook (bd wobble)
Soerensen 4 hours ago
The bias-variance framing here maps well to what I've observed building AI-assisted workflows.
In practice, systematic misalignment (bias) is relatively easy to fix - you identify the pattern and add it to your prompt/context. "Always use our internal auth library" works reliably once specified.
Variance-dominated failures are a different beast. The same prompt, same context, same model can produce wildly different quality outputs on complex tasks. I've seen this most acutely when asking models to maintain consistency across multi-file changes.
The paper's finding that "larger models + harder problems = more variance" explains something I couldn't quite articulate before: why Sonnet sometimes outperforms Opus on specific workflows. The "smarter" model attempts more sophisticated solutions, but the solution space it's exploring has more local minima where it can get stuck.
One practical takeaway: decomposing complex tasks into smaller, well-specified subtasks doesn't just help with context limits - it fundamentally changes the bias/variance profile of each inference call. You're trading one high-variance call for multiple lower-variance calls, which tends to be more predictable even if it requires more orchestration overhead.
BenoitEssiambre 14 hours ago
This matches my intuition. Systematic misalignment seems like it could be prevented by somewhat simple rules like the hippocratic oath or Asimov's Laws of robotics or rather probabilistic bayesian versions of these rules that take into account error bounds and risk.
The probabilistic version of "Do No Harm" is "Do not take excessive risk of harm".
This should work as AIs become smarter because intelligence implies becoming better bayesians which implies being great at calibrating confidence intervals of their interpretations and their reasoning and basically gaining a superhuman ability for evaluating the bounds of ambiguity and risk.
Now this doesn't mean that AIs won't be misaligned, only that it should be possible to align them. Not every AI maker will necessarily bother to align them properly, especially in adversarial, military applications.
leahtheelectron 14 hours ago
It's nice seeing this with Sohl-Dickstein as the last author after reading this blog post from him some time ago: https://sohl-dickstein.github.io/2023/03/09/coherence.html
tbrownaw 14 hours ago
Longer thinking sections have more space for noise to accumulate?
bjt12345 13 hours ago
> This suggests that scaling alone won't eliminate incoherence. As more capable models tackle harder problems, variance-dominated failures persist or worsen.
This is a big deal, but are they only looking at auto-regressive models?
kalap_ur 3 hours ago
Well, this sounds like a "no shit Sherlock" statement: >>Finding 3: Natural "overthinking" increases incoherence more than reasoning budgets reduce it We find that when models spontaneously reason longer on a problem (compared to their median), incoherence spikes dramatically. Meanwhile, deliberately increasing reasoning budgets through API settings provides only modest coherence improvements. The natural variation dominates.<<
Language models are probabilistic and not deterministic. Therefore incoherence _by definition_ increases as a response becomes lengthier. This is not true for humans, who tend to act/communicate deterministically. If I ask the human, to read a pdf and ask, is there a word of "paperclip" in the pdf? The human deterministically will provide a yes/no answer and no matter how many times we repeat the process, they will provide the same answer consistently (not due to autocorrelation, because this can be done across different humans). LMs will have a probabilistic response - dependent on the training itself: with a very well trained model we can get a 99% probabilistic outcome, which means out of 100 simulations, it will give you 1 time the wrong answer. We have no clue about the "probablistic" component for LMs, however, simulations could be done to research this. Also, I would be very curious about autocorrelation in models. If a human did a task and came to a conclusion "yes", then he will always respond with increasing amount of eyerolling to the same task: "yes".
Also, imagine the question: "is the sky blue?" answer1: "Yes." This has 0 incoherence. answer2: "Yes, but sometimes it looks like black, sometimes blue." While this answer seemingly has 0 incoherence, the probability of increased incoherence is larger than 0 given that answer generation itself is probabilistic. Answer generation by humans is not probabilistic.
Therefore, probability driven LMs (all LMs today are probability driven) will always exhibit higher incoherence than humans.
I wonder if anybody would disagree with the above.
bazzmt 10 hours ago
"model failures become increasingly dominated by incoherence rather than systematic misalignment."
This should not be surprising.
Systematic misalignment, i.e., bias, is still coherent and rational, if it is to be systematic. This would require that AI reason, but AI does not reason (let alone think), it does not do inference.
nayroclade 16 hours ago
The models they tested are already way behind the current state-of-the-art. Would be interesting to see if their results hold up when repeated with the latest frontier models.
[-]
- StilesCrisis 4 hours ago
  I think we have all seen the latest models turn into a hot mess.
cadamsdotcom 14 hours ago
When humans dream, we are disconnected from the world around us. Without the grounding that comes from being connected to our bodies, anything can happen in a dream.
It is no surprise that models need grounding too, lest their outputs be no more useful than dreams.
It’s us engineers who give arms and legs to models, so they can navigate the world and succeed at their tasks.
[-]
- sayamqazi 10 hours ago
  Also since dreams are built from the combinations of experiences that brain already knows so we cannot die in a dream as our brain does not know how to replicate what it would feel like after being dead. Basically LLMs also cannot produce truly novel ideas.
eande 12 hours ago
The findings are based on older models and assuming recent models behave similarly, what kind of prompt style one should use then to improve the outcome to avoid the increase in variance especially when you ask a model to solve really complex problems?
lewdwig 8 hours ago
I guess it’s reassuring to know Hanlon’s Razor holds for AGI too.
makerdiety 3 hours ago
Intelligence is inherently not aligned with humanity, you mean? Why am I not shocked?
hogehoge51 14 hours ago
My ignorant question: They did bias and variance noise, how about quantisation noise? I feel like sometimes agents are "flipfloping" between metastable divergent interpretations of the problem or solution.
root_axis 14 hours ago
This is very interesting research and a great write up.
I just want to nitpick something that really annoys me that has become extremely common: the tendency to take every opportunity to liken all qualities of LLMs to humans. Every quirk, failure, oddity, limitation, or implementation detail is relentlessly anthropomorphized. It's to the point where many enthusiasts have convinced themselves that humans think by predicting the next token.
It feels a bit like a cult.
Personally, I appreciate more sobriety in tech, but I can accept that I'm in the minority in that regard.
IgorPartola 16 hours ago
For some reason the article reads to me like “AI is not evil, it just has accidents when it loses coherence.” Sounds a lot like liability shifting.
[-]
- dmix 14 hours ago
  They compared it to industrial accidents. I don't think a software company would try to shift liability by comparing themselves to factories explosions and chemical spills.
tsunamifury 16 hours ago
I don’t know why it seems so hard for these guys to understand you scorecard every step for new strategy to Close distance at goal and if you have multiple generated forward options with no good weight you spawn a new agent and multiple paths. Then you score all the terminal branches and prune.
LLMs aren’t constrained to linear logic like your average human.
gnarlouse 12 hours ago
I feel vindicated when I say that the superintelligence control problem is a total farce, we won't get to superintelligence, it's tantamount to a religious belief. The real problem is the billionaire control problem. The human-race-on-earth control problem.
[-]
- MrOrelliOReilly 11 hours ago
  I don’t believe the article makes any claims on the infeasibility of a future ASI. It just explores likely failure modes.
  It is fine to be worried about both alignment risks and economic inequality. The world is complex, there are many problems all at once, we don’t have to promote one at the cost of the other.
- HNisCIS 11 hours ago
  Yeah article aside, looking back on all the AGI stuff from the last year or so really puts our current moment in protective.
  This whole paradigm of AI research is cool and all but it's ultimately a simple machine that probabilistically forms text. It's really good at making stuff that sounds smart but like looking at an AI picture, it falls apart the harder you look at it. It's good at producing stuff that looks like code and often kinda works but based on the other comments in this thread I don't think people really grasp how these models work.
throwpoaster 16 hours ago
[flagged]
[-]
- dang 13 hours ago
  Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.
  If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
  [-]
  - throwpoaster 2 hours ago
    I will, I apologize, and I love the site.
    I do try to contribute constructively but am annoyed at getting downvote-hammered by what I perceive as an echo chamber.
    It is very possible that I lack the social skills to understand how what I am doing is inappropriate. I will read the guidelines.
    Sorry, and thanks for your efforts.