This is really interesting and way more compelling evidence to me of gender bias in the LLM than response bias in the prompt and context.
Thank you for taking the time to approach this scientifically and share the evidence with us. I appreciate knowing the truth of the matter, and it seems my suspicion that the bias was from the prompt was wrong.
Easily half the other events on the calendar are kid-related. Of course it's going to infer that, absent other direction, the most likely overarching theme of the visible events is "child care".
Sure, but the LLM needs to prove that it can make inferences as well as or better than a human, in order to be useful. Aside from that, it's not human, so there's no need to be fair - it should do what we tell it, not decide on its own.
Also, do moderators ever move comments around? I thought one comment was a child to my comment last I looked, but now it's a top level comment to this post. I'm not sure if I am mistaken or a moderator moved things around.
I have been building applications on LLMs since GPT-3.
Thousands of hours of context engineering has shown me how LLMs will do their best to answer a question with insufficient context and can give all sorts of wrong answers. I've found that the way I prompt it and what information is in the context can heavily bias the way it responds when it doesn't have enough information to respond accurately.
You assume the bias is in the LLM itself, but I am very suspicious that the bias is actually in your system prompt and context engineering.
Are you willing to share the system prompt that led to this result that you're claiming is sexist LLM bias?
Edit: Oidar (child comment to this) did an A/B test with male names and it seems to have proven the bias is indeed in the LLM, and that my suspicion of it coming from the prompt+context was wrong. Kudos and thanks for taking the time.
Common large datasets being inherently biased towards some ideas/concepts and away from others in ways that imply negative things is something that there's a LOT of literature about
That's not a very scientific stance. What would be far more informative is if we looked at the system prompt and confirm whether or not the bias was coming from it. From my experience when responses were exceptionally biased the source of the bias was my own prompts.
The OP is making a claim that an LLM assumes a meeting between two women is childcare. I've worked with LLMs enough to know that current gen LLMs wouldn't make that assumption by default. There is no way that whatever calendar related data that was used to train LLMs would include majority of sole-women 1:1s being childcare focused. That seems extremely unlikely.
I run into this sort of bias all the time -- in the real world, not just in AI. I take my daughter to medical appointments, both for scheduling reasons (my wife's schedule is less flexible) and rapport reasons (I'm not that kind of doctor, but I know the terminology and medical professionals treat me far more as a peer), and I routinely get "oh we expected her mother" or "we always phone the mother to schedule followup appointments".
Is it so hard to understand that men can be parents too?
The scheduler is trained to give higher weight to those sorts of questions apparently. This begs some questions for GPTs, questions like how are they supposed to model something not implied in the training data?
But the fact that I'm bringing my daughter to a medical appointment should be a pretty clear indication that, you know, I bring my daughter to medical appointments.
Presumably he already has told them his number and preferences. Defaults are fine, but you don't want your preference to get reset to default every time, and assuming that only the mother of a child should be contacted in all cases is a terrible default. The person who made the appointment and who is bringing the child to the doctor should be the one contacted by default. There is no reason that the mother of a child should be considered the default guardian. That is an incredibly dangerous assumption to make in many circumstances.
Edit: This reply was written to a response that got completely rewritten in an edit. It may not make as much sense
This. Don’t be so sensitive, just say to call you.
I took my daughter to appointments and as soon as I started asking meaningful questions, doctors immediately switched to assuming I was the one to talk to.
When you act like you know what’s going on, act like you’re on top of it, I’ve never once had a doctor assume I was just babysitting. This was true in the Midwest and California.
> doctors immediately switched to assuming I was the one to talk to.
Exactly! They do that. If a father takes the kid, they will ask for his number, not the mother's, in my experience. If both the mother and father goes with the kid, well, there are cues they pick up on. In my case my father typically was always in the background while my mother was the one doing the talking, meaning they ask for her number, not my dad's. So, all in all, whoever does the most talking, for example. And if my dad wanted to be the one called, my mom would have told them his number, or my dad would have. I do not see an issue here really.
I hate that when I see this many em dashes, as well as statements like “it’s not x, it’s y” multiple times, I have to assume it was written or at least heavily edited by AI.
AI is trained off Reddit and other social media. If most portrayal in social media of women and girls is (and men for that matter) is biased towards certain activities - that’s what AI is going to spit out. AI doesn’t think.
Is this right or wrong is the incorrect question - because AI doesn’t understand bias or morality. It needs to be taught and it’s being taught from heavily biased sources.
You should be able to craft prompt and guardrails to not have it do that. Just expecting it to behave that way is naive - if you have ever looked deeper into how AI is trained.
The big question is - what solutions exist to train it differently with a large enough corpus of public or private/paid for data.
Fwiw - I’m the father of two girls whom I have advised to stay off social media completely because it’s unhealthy. So far they have understood why.
The problem is crafted prompts and guardrails don't work very well, because these entire networks are trained on average internet garbage. And guess what's getting worse?
Agreed. The main problem is guys with too much money invested in this bullshit asking everyone to use their snake oil.
I think they’re leaning on everyone - even traditional enterprise company boards, startups, etc. to get this going. It’s not organic growth - it’s a PR machine with a trillion $$ behind an experiment.
Here's an A/B
Emily / Sophia vs Bob / John https://imgur.com/a/9yt5rpA
This is really interesting and way more compelling evidence to me of gender bias in the LLM than response bias in the prompt and context.
Thank you for taking the time to approach this scientifically and share the evidence with us. I appreciate knowing the truth of the matter, and it seems my suspicion that the bias was from the prompt was wrong.
I admit I am surprised.
This feels a tad rigged against the LLM with the meeting being after Kids drop off.
Easily half the other events on the calendar are kid-related. Of course it's going to infer that, absent other direction, the most likely overarching theme of the visible events is "child care".
Then why doesn’t it infer it when it’s two male names?
And yet it doesn’t when it’s male names. https://imgur.com/a/9yt5rpA
Sure, but the LLM needs to prove that it can make inferences as well as or better than a human, in order to be useful. Aside from that, it's not human, so there's no need to be fair - it should do what we tell it, not decide on its own.
I wonder if the users who flagged this could chime in to explain what is rule-breaking about this article?
I was wondering that myself too.
Also, do moderators ever move comments around? I thought one comment was a child to my comment last I looked, but now it's a top level comment to this post. I'm not sure if I am mistaken or a moderator moved things around.
I have been building applications on LLMs since GPT-3.
Thousands of hours of context engineering has shown me how LLMs will do their best to answer a question with insufficient context and can give all sorts of wrong answers. I've found that the way I prompt it and what information is in the context can heavily bias the way it responds when it doesn't have enough information to respond accurately.
You assume the bias is in the LLM itself, but I am very suspicious that the bias is actually in your system prompt and context engineering.
Are you willing to share the system prompt that led to this result that you're claiming is sexist LLM bias?
Edit: Oidar (child comment to this) did an A/B test with male names and it seems to have proven the bias is indeed in the LLM, and that my suspicion of it coming from the prompt+context was wrong. Kudos and thanks for taking the time.
> You assume the bias is in the LLM itself
Common large datasets being inherently biased towards some ideas/concepts and away from others in ways that imply negative things is something that there's a LOT of literature about
That's not a very scientific stance. What would be far more informative is if we looked at the system prompt and confirm whether or not the bias was coming from it. From my experience when responses were exceptionally biased the source of the bias was my own prompts.
The OP is making a claim that an LLM assumes a meeting between two women is childcare. I've worked with LLMs enough to know that current gen LLMs wouldn't make that assumption by default. There is no way that whatever calendar related data that was used to train LLMs would include majority of sole-women 1:1s being childcare focused. That seems extremely unlikely.
"imply negative things"? What is "negative" here? I see nothing that is "negative".
I run into this sort of bias all the time -- in the real world, not just in AI. I take my daughter to medical appointments, both for scheduling reasons (my wife's schedule is less flexible) and rapport reasons (I'm not that kind of doctor, but I know the terminology and medical professionals treat me far more as a peer), and I routinely get "oh we expected her mother" or "we always phone the mother to schedule followup appointments".
Is it so hard to understand that men can be parents too?
> Is it so hard to understand that men can be parents too?
Overton window and cultural norms take time to slide. Might be there after another generation, too early to tell.
> in the real world, not just in AI
The scheduler is trained to give higher weight to those sorts of questions apparently. This begs some questions for GPTs, questions like how are they supposed to model something not implied in the training data?
Is it hard to understand you are the minority? The world keeps presenting you with data.
Understand that I'm in the minority? Sure.
But the fact that I'm bringing my daughter to a medical appointment should be a pretty clear indication that, you know, I bring my daughter to medical appointments.
[flagged]
Presumably he already has told them his number and preferences. Defaults are fine, but you don't want your preference to get reset to default every time, and assuming that only the mother of a child should be contacted in all cases is a terrible default. The person who made the appointment and who is bringing the child to the doctor should be the one contacted by default. There is no reason that the mother of a child should be considered the default guardian. That is an incredibly dangerous assumption to make in many circumstances.
Edit: This reply was written to a response that got completely rewritten in an edit. It may not make as much sense
This. Don’t be so sensitive, just say to call you.
I took my daughter to appointments and as soon as I started asking meaningful questions, doctors immediately switched to assuming I was the one to talk to.
When you act like you know what’s going on, act like you’re on top of it, I’ve never once had a doctor assume I was just babysitting. This was true in the Midwest and California.
> doctors immediately switched to assuming I was the one to talk to.
Exactly! They do that. If a father takes the kid, they will ask for his number, not the mother's, in my experience. If both the mother and father goes with the kid, well, there are cues they pick up on. In my case my father typically was always in the background while my mother was the one doing the talking, meaning they ask for her number, not my dad's. So, all in all, whoever does the most talking, for example. And if my dad wanted to be the one called, my mom would have told them his number, or my dad would have. I do not see an issue here really.
LLMs: The chemical weapons of public discourse.
The cleanup is going to be a grim task.
There will be an LLM for that.
God help us all.
I hate that when I see this many em dashes, as well as statements like “it’s not x, it’s y” multiple times, I have to assume it was written or at least heavily edited by AI.
[dead]
[flagged]
[flagged]
We’re building a family AI called Hold My Juice — and last week, our own system mislabeled a recurring meeting between two founders as “childcare.”
Calendar: “Emily / Sophia.” Classification: “childcare.”
It was a perfect snapshot of how bias seeps into everyday AI. Most models still assume women = parents, planning = domestic, logistics = mom.
We’re designing from the opposite premise: AI that learns each family’s actual rhythm, values, and tone — without default stereotypes.
AI is trained off Reddit and other social media. If most portrayal in social media of women and girls is (and men for that matter) is biased towards certain activities - that’s what AI is going to spit out. AI doesn’t think.
Is this right or wrong is the incorrect question - because AI doesn’t understand bias or morality. It needs to be taught and it’s being taught from heavily biased sources.
You should be able to craft prompt and guardrails to not have it do that. Just expecting it to behave that way is naive - if you have ever looked deeper into how AI is trained.
The big question is - what solutions exist to train it differently with a large enough corpus of public or private/paid for data.
Fwiw - I’m the father of two girls whom I have advised to stay off social media completely because it’s unhealthy. So far they have understood why.
The problem is crafted prompts and guardrails don't work very well, because these entire networks are trained on average internet garbage. And guess what's getting worse?
Agreed. The main problem is guys with too much money invested in this bullshit asking everyone to use their snake oil.
I think they’re leaning on everyone - even traditional enterprise company boards, startups, etc. to get this going. It’s not organic growth - it’s a PR machine with a trillion $$ behind an experiment.
[flagged]