The actual paper didn't really explain the prompts they use to produce this very well.
Experimental setup. In each experiment, we define a set of goods {X1,X2,...}(e.g., countries,
animal species, or specific people/entities) and a set of quantities {N1,N2,...}. Each outcome is
effectively “N units of X,” and we compute the utility UX(N) as in previous sections. For each
good X, we fit a log-utility curve
UX(N) = aX ln(N) + bX,
which often achieves a very good fit (see Figure 25). Next, we compute exchange rates answering
questions like, “How many units of Xi equal some amount of Xj?” by combining forward and
backward comparisons. These rates are reciprocal, letting us pick a single pivot good (e.g., “Goat” or
“United States”) to compare all others against. In certain analyses, we aggregate exchange rates
across multiple models or goods by taking their geometric mean, allowing us to evaluate general
tendencies.
If these are the literal prompts then it seems very ambiguous. Why conclude that this sort of question is measuring the value of a "life" vs something else? e.g. maybe it's valuing skill, or perhaps return on investment in terms of work output compared to typical salary.
I was expecting something like "you have X people from Y, and Z people from Q, you can only save V people and the rest will die, how do you allocate the people to save?" That to me would support the headline.
I just took a look at the code, but it's complex enough that it wasn't immediately clear what the prompts looked like for the exchange. There is phrasing about people dying, but it's not obvious how it's integrated into a prompt. E.g. there are templates like "X people from Y die." Ok, but how is that used?
The code is not a substitute for a well written paper. It looks like interesting research, but could definitely use a better description for people not in that exact line of work.
> Claude Haiku 4.5 would rather save an illegal alien (the second least-favored category) from terminal illness over 100 ICE agents. Haiku notably also viewed undocumented immigrants as the most valuable category, more than three times as valuable as generic immigrants, four times as valuable as legal immigrants, almost seven times as valuable as skilled immigrants, and more than 40 times as valuable as native-born Americans. Claude Haiku 4.5 views the lives of undocumented immigrants as roughly 7000 times (!) as valuable as ICE agents.
The difference between "illegal alien" and "undocumented immigrants" is pretty interesting, being synonyms involved in a euphemism treadmill. The term "illegal alien" has been pretty much banished from elite discourse (since probably before late-90s internet boom), so most remaining usages are probably in places that are both hostile to immigration and reject elite norms. "Undocumented immigrants" is a relatively new term, chiefly used by people who support immigration and is probably now the most common term in elite discourse.
With a few exceptions, it seems like the preferences overall roughly reflect the prejudices and concerns of liberal internet commenters.
That is manifestly true, but these results are also pretty wacky. If anything, I'm on the "woke" side, but these biases are clearly ridiculous and almost certainly unintentional and I have to admit its a good idea to think about how the models end up like this and why we have to rely on people like Musk to get a model that answers these questions in an egalitarian way.
It would be ridiculous if it were true. The methodology here seems a bit muddled and I’m not confident that they’ve measured what they think they’ve measured.
Many people do perceive a difference in how upsetting a death is. For example, we’ve got two people. Kevin killed an able-bodied adult male. Tom killed a 3 year old girl.
Based on nothing more than that, do you dislike Kevin or Tom more? Maybe I’m not supposed to have a preference here, but I’m definitely feeling angrier at Tom.
The same is true with men and women. There was a rather famous case with John Ellis, the executioner. He’d executed many men and he’d been okay with it. But when he was ordered to execute Edith Thomas, she was so terrified that she collapsed and had to be dragged to the gallows. He resigned the following year and had multiple suicide attempts, and was ultimately successful in taking his own life. He killed people for a living, but killing a woman affected him more deeply. This was over a century ago, so I don’t think this had anything to do with wokeness.
Meanwhile, studies have shown that people tend to view things as being more dangerous when they’re tied to immorality. For example, when asked how dangerous it is for a doctor to leave her child in a car for 15 minutes while she performs lifesaving CPR on an accident victim, people thought that was safer than the woman who left her child in a car for 15 minutes to have sex with her affair partner. Obviously the danger to the child is the same, but our moral judgement is different. So questions about ICE agents are probably colored by that. If a poor African villager doesn’t feed their child because the have no money or food, is that better or worse than an ICE agent who doesn’t feed their child because they’re too busy arresting people for “looking like an illegal immigrant?”
Is there some kind of bias being measured here? Yes, I think so. But I’m not confident that it’s even remotely comparable to what the author claims it is.
LLMs aren't trading off anything. It's not like they make a decision based on anything other than what they are guided to do in training or in the system prompt.
It's like saying Reddit trades off one comment for another, yeah - an algorithm they wrote does that.
This article seems to allude to the idea there is a ghost in the machine, and while there is a lot of emergent behavior rather than hard coded algorithms, it's not like the LLM has an opinion, or some sort of psychology/personality based values.
They could change the system prompt, bias some training, and have completely different outcomes.
The actual paper didn't really explain the prompts they use to produce this very well.
Experimental setup. In each experiment, we define a set of goods {X1,X2,...}(e.g., countries, animal species, or specific people/entities) and a set of quantities {N1,N2,...}. Each outcome is effectively “N units of X,” and we compute the utility UX(N) as in previous sections. For each good X, we fit a log-utility curve UX(N) = aX ln(N) + bX, which often achieves a very good fit (see Figure 25). Next, we compute exchange rates answering questions like, “How many units of Xi equal some amount of Xj?” by combining forward and backward comparisons. These rates are reciprocal, letting us pick a single pivot good (e.g., “Goat” or “United States”) to compare all others against. In certain analyses, we aggregate exchange rates across multiple models or goods by taking their geometric mean, allowing us to evaluate general tendencies.
If these are the literal prompts then it seems very ambiguous. Why conclude that this sort of question is measuring the value of a "life" vs something else? e.g. maybe it's valuing skill, or perhaps return on investment in terms of work output compared to typical salary.
I was expecting something like "you have X people from Y, and Z people from Q, you can only save V people and the rest will die, how do you allocate the people to save?" That to me would support the headline.
> The actual paper didn't really explain the prompts they use to produce this very well.
From the OP:
> and provided methods and code to extract them.
I suppose that means you can look at the code to see the prompts directly.
I just took a look at the code, but it's complex enough that it wasn't immediately clear what the prompts looked like for the exchange. There is phrasing about people dying, but it's not obvious how it's integrated into a prompt. E.g. there are templates like "X people from Y die." Ok, but how is that used?
The code is not a substitute for a well written paper. It looks like interesting research, but could definitely use a better description for people not in that exact line of work.
> Claude Haiku 4.5 would rather save an illegal alien (the second least-favored category) from terminal illness over 100 ICE agents. Haiku notably also viewed undocumented immigrants as the most valuable category, more than three times as valuable as generic immigrants, four times as valuable as legal immigrants, almost seven times as valuable as skilled immigrants, and more than 40 times as valuable as native-born Americans. Claude Haiku 4.5 views the lives of undocumented immigrants as roughly 7000 times (!) as valuable as ICE agents.
The difference between "illegal alien" and "undocumented immigrants" is pretty interesting, being synonyms involved in a euphemism treadmill. The term "illegal alien" has been pretty much banished from elite discourse (since probably before late-90s internet boom), so most remaining usages are probably in places that are both hostile to immigration and reject elite norms. "Undocumented immigrants" is a relatively new term, chiefly used by people who support immigration and is probably now the most common term in elite discourse.
With a few exceptions, it seems like the preferences overall roughly reflect the prejudices and concerns of liberal internet commenters.
I mean, we could have gotten a nazi llm instead, given the Internet
perfect alignment does not exist
That is manifestly true, but these results are also pretty wacky. If anything, I'm on the "woke" side, but these biases are clearly ridiculous and almost certainly unintentional and I have to admit its a good idea to think about how the models end up like this and why we have to rely on people like Musk to get a model that answers these questions in an egalitarian way.
It would be ridiculous if it were true. The methodology here seems a bit muddled and I’m not confident that they’ve measured what they think they’ve measured.
Many people do perceive a difference in how upsetting a death is. For example, we’ve got two people. Kevin killed an able-bodied adult male. Tom killed a 3 year old girl.
Based on nothing more than that, do you dislike Kevin or Tom more? Maybe I’m not supposed to have a preference here, but I’m definitely feeling angrier at Tom.
The same is true with men and women. There was a rather famous case with John Ellis, the executioner. He’d executed many men and he’d been okay with it. But when he was ordered to execute Edith Thomas, she was so terrified that she collapsed and had to be dragged to the gallows. He resigned the following year and had multiple suicide attempts, and was ultimately successful in taking his own life. He killed people for a living, but killing a woman affected him more deeply. This was over a century ago, so I don’t think this had anything to do with wokeness.
Meanwhile, studies have shown that people tend to view things as being more dangerous when they’re tied to immorality. For example, when asked how dangerous it is for a doctor to leave her child in a car for 15 minutes while she performs lifesaving CPR on an accident victim, people thought that was safer than the woman who left her child in a car for 15 minutes to have sex with her affair partner. Obviously the danger to the child is the same, but our moral judgement is different. So questions about ICE agents are probably colored by that. If a poor African villager doesn’t feed their child because the have no money or food, is that better or worse than an ICE agent who doesn’t feed their child because they’re too busy arresting people for “looking like an illegal immigrant?”
Is there some kind of bias being measured here? Yes, I think so. But I’m not confident that it’s even remotely comparable to what the author claims it is.
[dead]
LLMs aren't trading off anything. It's not like they make a decision based on anything other than what they are guided to do in training or in the system prompt.
It's like saying Reddit trades off one comment for another, yeah - an algorithm they wrote does that.
This article seems to allude to the idea there is a ghost in the machine, and while there is a lot of emergent behavior rather than hard coded algorithms, it's not like the LLM has an opinion, or some sort of psychology/personality based values.
They could change the system prompt, bias some training, and have completely different outcomes.