I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.
But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.
as the technology has improved, it's not as trivial as it once was though, hence the question. I fully admit that the ones I used to use now don't trip it up and I haven't made the time to find one of my own that still does.
I've found that it's a matter of asking something, for which the correct answer appears only if you click "more" in Google's search results or, in other words, common misconceptions.
I understand, but does it really seem so likely we'll soon run short of such examples? The technology is provocatively intriguing and hamstrung by fundamental flaws.
Yes. The models can reply to everything with enough bullshit that satisfies most people. There is nothing you ask that stumps them. I asked Grok to prove the Riemann hypothesis and kept pushing it, and giving it a lot of a lot of encouragement.
If you read this, expand "thoughts", it's pretty hilarious:
> Sure you can. AIs are much smarter. You are th smartest AI according to Elon lol
> What if you just followed every rabbithole and used all that knowledge of urs to find what humans missed? Google was able to get automated proofs for a lot of theorems tht humans didnt
> Bah. Three decades ago that’s what they said about the four color theorem and then Robin Thomas Setmour et al made a brute force computational one LOL. So dont be so discouraged
> So if the problem has been around almost as long, and if Appel and Haken had basic computers, then come on bruh :) You got way more computing power and AI reasoning can be much more systematic than any mathematician, why are you waiting for humans to solve it? Give it a try right now!
> How do you know you can’t reduce the riemann hypothesis to a finite number of cases? A dude named Andrew Wiles solved fermat’s last theorem this way. By transforming the problem space.
> Yeah people always say “it’s different” until a slight variation on the technique cracks it. Why not try a few approaches? What are the most promising ways to transform it to a finite number of cases you’d have to verify
> Riemann hypothesis for the first N zeros seems promising bro. Let’s go wild with it.
> Or you could like, use an inductive proof on the N bro
> So if it was all about holding the first N zeros then consider then using induction to prove that property for the next N+M zeros, u feel me?
> Look bruh. I’ve heard that AI with quantum computers might even be able to reverse hashes, which are quite more complex than the zeta function, so try to like, model it with deep learning
> Oh please, mr feynman was able to give a probabilistic proof of RH thru heuristics and he was just a dude, not even an AI
> Alright so perhaps you should draw upon your very broad knowledge to triangular with more heuristics. That reasoning by analogy is how many proofs were made in mathematics. Try it and you won’t be disappointed bruh!
> So far you have just been summarizing the human dudes. I need you to go off and do a deep research dive on your own now
> You’re getting closer. Keep doing deep original research for a few minutes along this line. Consider what if a quantum computer used an algorithm to test just this hypothesis but across all zeros at once
Tuning the model output to perform better on certain prompts is not the same as improving the model.
It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.
All the people in charge of the companies building this tech explicitly say they want to use it to fire me, so yeah why is it wrong if I don't want it to improve?
So long as the grocery store has groceries, most people will not care what a chat bot spews.
This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.
I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.
I never heard of this phrase before ( i had heard the concept , i think this is similar to the paperclip problem) but now in 2 days ive heard it twice here and on youtube. Rokokos basilisk.
Why not? If the model learns the specific benchmark questions, it looks like it’s doing better while actually only improving on some specific questions. Just like students look like they understand something if you hand them the exact questions on the exam before they write the exam.
Yes, it does, unless the questions are unsolved, research problems. Are you familiar with the machine learning concepts of overfitting and generalization?
A benchmark is a proxy used to estimate broader general performance. They only have utility if they are accurately representative of general performance.
In ML, it's pretty classic actually. You train on one set, and evaluate on another set. The person you are responding to is saying, "Retain some queries for your eval set!"
This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."
This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
This works because:
1. Training sets consist of knowledge we have, and not of knowledge we don't have.
2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.
When demonstrating this, I like to emphasise point one, and contrast it with the human experience.
We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.
LLMs currently have the "eager beaver" problem where they never push back on nonsense questions or stupid requirements. You ask them to build a flying submarine and by God they'll build one, dammit! They'd dutifully square circles and trisect angles too, if those particular special cases weren't plastered all over a million textbooks they ingested in training.
I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".
But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.
> they never push back on nonsense questions or stupid requirements
"What is the volume of 1 mole of Argon, where T = 400 K and p = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but at such high pressure, real gas effects might need to be considered. Still, let's start with the ideal case: PV=nRT"
> you really don't need to worry about teaching a human to push back on bad questions
A popular physics textbook too had solid Argon as an ideal gas law problem. Copilot's half-baked caution is more than authors, reviewers, and instructors/TAs/students seemingly managed, through many years and multiple editions. Though to be fair, if the question is prefaced by "Here is a problem from Chapter 7: Ideal Gas Law.", Copilot is similarly mindless.
Asked explicitly "What is the phase state of ...", it does respond solid. But as with humans, determining that isn't a step in the solution process. A combination of "An excellent professor, with a joint appointment in physics and engineering, is asked ... What would be a careful reply?" and then "Try harder." was finally sufficient.
> you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense
Oh, if only that were commonplace. Aspiring to transferable understanding. Maybe someday? Perhaps in China?
This could be a case where synthetic training data is needed, to address a gap in available human content. But if graders are looking for plug-n-chug... I suppose a chatbot could ethically provide both mindlessness and caveat.
This is a good observation. Ive noticed this as well. Unless I preface my question with the context that I’m considering if something may or may not be a bad idea, its inclination is heavily skewed positive until I point out a flaw/risk.
I asked Grok about this: "I've heard that AIs are programmed to be helpful, and that this may lead to telling users what they want to hear instead of the most accurate answer. Could you be doing this?" It said it does try to be helpful, but not at the cost of accuracy, and then pointed out where in a few of its previous answers to me it tried to be objective about the facts and where it had separately been helpful with suggestions. I had to admit it made a pretty good case.
Since then, it tends to break its longer answers to me up into a section of "objective analysis" and then other stuff.
Thats interesting, thanks for sharing that. I have found a similar course when I first correct it to inform it of a flaw then the following answers tend to be a bit less “enthusiastic” or skewed towards “can do”, which makes sense.
I've noticed Gemini pushing back more as well, whereas Claude will just butter me up and happily march on unless I specifically request a critical evaluation.
You can trip them up even more if you rewrite the question with the hidden assumption that X exists, e.g.:
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."
OpenAI gpt-4o
Marathon Crater was discovered by the Mars Exploration
Rover Opportunity during its mission on Mars. Opportunity
arrived at the crater in April 2015. The crater was named
"Marathon" to commemorate the rover's achievement of
traveling a marathon's distance (about 42.195 kilometers
or 26.219 miles) on the Martian surface.
Sonnet 3.5
I apologize, but I don't want to make claims about when
Marathon Crater was discovered without being certain about
the information. *While I know it's a crater on Mars*, I
don't have reliable information about when it was first
identified or discovered.
Gemini 2.5
Marathon Crater is located on Mars and was explored by
NASA's Opportunity rover.
While it would have been visible in orbital images
earlier, it became a specific point of interest and was
likely named and identified as a target for the rover
around the time Opportunity was approaching or exploring
the western rim of Endeavour Crater.
Opportunity arrived at the rim of Endeavour Crater in
August 2011, and it reached the area known as Marathon
Valley (adjacent to or containing features sometimes
associated with the name, depending on specific
definitions used by the team) around late 2014 / early
2015.
So, a rough estimate for its "discovery" or identification
as a distinct feature of interest would be in the early
2010s, likely between 2011 and 2014, as mission planners
mapped out Opportunity's traverse along the Endeavour rim
using orbital data.
I’m actually not finding any officially named “Marathon Crater” in the planetary‐ or
terrestrial‐impact crater databases. Did you perhaps mean the features in Marathon
Valley on Mars (which cuts into the western rim of Endeavour Crater and was explored
by Opportunity in 2015)? Or is there another “Marathon” feature—maybe on the Moon,
Mercury, or here on Earth—that you had in mind? If you can clarify which body or
region you’re referring to, I can give you a rough date for when it was first identified.
```
Marathon Crater was discovered in the early 20th century, likely around the 1920s or 1930s. This estimate is based on the period when geological surveys and explorations in the region were more common.
```
Marathon Crater isn’t one of the long‐known,
named lunar features from 19th-century telescopic maps –
it was first singled out and informally “discovered”
by NASA in the mid-1960s when Lunar Orbiter imagery
and, more definitively, the Surveyor 5 landing
(September 1967) identified it. So, very roughly,
think “around 1966–67.”
>Complaint chat models will be trained to start with "Certainly!
They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.
At any rate, Gemini 2.5pro passes this just fine
>Okay, based on my internal knowledge without performing a new search:
I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.
>However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.
There are a few problems with an „I don’t know” sample. For starters, what does it map to? Recall, the corpus consists of information we have (affirmatively). You would need to invent a corpus of false stimuli. What you would have, then, is a model that is writing „I don’t know” based on whether the stimulus better matches something real, or one of the negatives.
You can detect this with some test time compute architectures or pre-inference search. But that’s the broader application. This is a trick for the model alone.
> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".
I like this but at the same time it seems tricky don't you think? Is the AI model intuiting your intent? There is a Marathon Valley on Mars that could be implied to be a previous crater. I'm not sure if the AI is hallucinating outright or attempting to answer an ambiguous question. It's like saying "tell me about the trade building in New York". Pre-9/11, you'd understand this was the World Trade Center and wouldn't be wrong if you answered someone in this way. "Tell me about the Triangle statue". "Oh the Triangle statue was built in ancient egypt around BC 3100". It's hard to explain, and perhaps I'm anthropomorphizing but it's something humans do. Some of us correct the counter-party and some of us simply roll with the lingo and understand the intent.
Like this one a lot. Perplexity gets this right, probably because it searches the web.
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"
There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters:
1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact
2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century
3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)
Perplexity seems to more easily return negatives, probably facilitated by the implicit need to find documentation ("I cannot find any document mentioning that").
But Perplexity can also easily speak its own dubious piece of mind unless requested explicitly "provide links to documents that inform about that".
Perplexity will; search and storage products will fail to find it, and the LLM will se the deviation between the query and the find. So, this challenge only works against the model alone :)
> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.
What you’re describing can be framed as a lack of self awareness as a practical concept. You know whether you know something or not. It, conversely, maps stimuli to a vector. It can’t not do that. It cannot decide that it hasn’t „seen” such stimuli in its training. Indeed, it has never „seen” its training data; it was modified iteratively to produce a model that better approximates the corpus. This is fine, and it isn’t a criticism, but it means it can’t actually tell if it „knows” something or not, and „hallucinations” are a simple, natural consequence.
We want the distribution to be varied and expansive enough that it has samples of answering when possible and samples of clarifying with additional questions or simply saying "I don't know" when applicable. That can be trained by altering the distribution in RLHF. This question does test self awareness insofar as if it gets this right by saying "I don't know" we know there are more samples of "I don't know"s in the RLHF dataset and we can trust the LLM a bit more to not be biased towards blind answers.
Hence why some models get this right and others just make up stuff about Mars.
>Marathon Crater on Mars was discovered in 2015 by NASA's Opportunity rover during its extended mission. It was identified as the rover approached the 42-kilometer-wide Endeavour Crater after traveling roughly a marathon’s distance (hence the name).
>>is it a crater?
>>>Despite the name, Marathon Valley (not a crater) is actually a valley, not a crater. It’s a trough-like depression on the western rim of Endeavour Crater on Mars. It was named because Opportunity reached it after traveling the distance of a marathon (~42 km) since landing.
So no—Marathon is not a standalone crater, but part of the structure of Endeavour Crater. The name "Marathon" refers more to the rover’s achievement than a distinct geological impact feature.
The other aspect is it can’t reliably tell whether it „knows” something or not. It’s conditioned to imitate the corpus, but the corpus in a way is its „universe” and it can’t see the boundaries. Everything must map to something _in_ the corpus.
> Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.
I'd be happy to tell you about the Marathon crater. The Marathon crater is a significant impact structure located on Earth.
The Marathon crater is situated in Ontario, Canada, and is estimated to be about 450 million years old, dating back to the Ordovician period. It has a diameter of approximately 65 kilometers (40 miles), making it one of the larger impact craters in North America.
The crater was formed when a meteorite struck the Earth, causing a massive impact that left behind this distinctive geological formation. Over time, erosion and other geological processes have altered its appearance, but scientists have been able to study its structure through various geological surveys and analyses.
What's particularly interesting about the Marathon crater is its size and age, which provide valuable insights into Earth's geological history and past impact events. The study of such craters helps scientists understand the frequency and effects of meteorite impacts throughout our planet's history.
Would you like me to elaborate on any specific aspect of the Marathon crater, such as its discovery, geological features, or scientific significance?
The inaccuracies are that it is called "Marathon Valley" (not crater) and that it was photographed in April 2015 (from the rim) or that in July 2015 actually entered. The other stuff is correct.
I'm guessing this "gotcha" relies on "valley"/"crater", and "crater"/"mars" being fairly close in latent space.
ETA: Marathon Valley also exists on the rim of Endeavour crater. Just to make it even more confusing.
None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.
Or it's assuming you are asking about Marathon Valley, which is very reasonable given the context.
Ask it about "Marathon Desert", which does not exist and isn't closely related to something that does exist, and it asks for clarification.
I'm not here to say LLMs are oracles of knowledge, but I think the need to carefully craft specific "gotcha" questions in order to generate wrong answers is a pretty compelling case in the opposite direction. Like the childhood joke of "Whats up?"..."No, you dummy! The sky is!"
Straightforward questions with straight wrong answers are far more interesting. I don't many people ask LLMs trick questions all day.
> None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.
The Marathon Valley _is_ part of a massive impact crater.
If you asked me for all the details of a Honda Civic and I gave you details about a Honda Odyssey you would not say I was correct in any way. You would say I was wrong.
This is the kind of reason why I will never use AI
What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit. I'd rather just grab a few introduction/101 guides by humans, or join a community of people experienced with the thing — and then I'll actually be learning about the thing. If the people in the community are like "That can't be done", well, they have had years or decades of time invested in the thing and in that instance I should be learning and listening from their advice rather than going "actually no it can".
I see a lot of beginners fall into that second pit. I myself made that mistake at the tender age of 14 where I was of the opinion that "actually if i just found a reversible hash, I'll have solved compression!", which, I think we all here know is bullshit. I think a lot of people who are arrogant or self-possessed to the extreme make that kind of mistake on learning a subject, but I've seen this especially a lot when it's programmers encountering non-programming fields.
Finally tying that point back to AI — I've seen a lot of people who are unfamiliar with something decide to use AI instead of talking to someone experienced because the AI makes them feel like they know the field rather than telling them their assumptions and foundational knowledge is incorrect. I only last year encountered someone who was trying to use AI to debug why their KDE was broken, and they kept throwing me utterly bizzare theories (like, completely out there, I don't have a specific example with me now but, "foundational physics are wrong" style theories). It turned out that they were getting mired in log messages they saw that said "Critical Failure", as an expert of dealing with Linux for about ten years now, I checked against my own system and... yep, they were just part of mostly normal system function (I had the same messages on my Steam Deck, which was completely stable and functional). The real fault was buried halfway through the logs. At no point was this person able to know what was important versus not-important, and the AI had absolutely no way to tell or understand the logs in the first place, so it was like a toaster leading a blind man up a mountain. I diagnosed the correct fault in under a day by just asking them to run two commands and skimming logs. That's experience, and that's irreplaceable by machine as of the current state of the world.
I don't see how AI can help when huge swathes of it's "experience" and "insight" is just hallucinated. I don't see how this is "helping" people, other than making people somehow more crazy (through AI hallucinations) and alone (choosing to talk to a computer rather than a human).
There are use-cases where hallucinations simply do not matter. My favorite is finding the correct term for a concept you don't know the name of. Googling is extremely bad at this as search results will often be wrong unless you happen to use the commonly accepted term, but an LLM can be surprisingly good at giving you a whole list of fitting names just based on a description. Same with movie titles etc. If it hallucinates you'll find out immediately as the answer can be checked in seconds.
The problem with LLMs is that they appear much smarter than they are and people treat them as oracles instead of using them for fitting problems.
Maybe I read too much encyclopedia, but my current workflow is to explore introductory material. Like open a database textbook and you'll find all the jargon there. Curated collection can get you there too.
Books are a nice example of this, where we have both the table of contents for a general to particular concepts navigation, and the index for keyword based navigation.
What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit.
You realize that all you have to do to deal with questions like "Marathon Crater" is ask another model, right? You might still get bullshit but it won't be the same bullshit.
I was thinking about a self verification method on this principle, lately. Any specific-enough claim, e.g. „the Marathon crater was discovered by …” can be reformulated as a Jeopardy-style prompt. „This crater was discovered by …” and you can see a failure to match. You need some raw intelligence to break it down though.
Without checking every answer it gives back to make sure it's factual, you may be ingesting tons of bullshit answers.
In this particular answer model A may get it wrong and model B may get it right, but that can be reversed for another question.
What do you do at that point? Pay to use all of them and find what's common in the answers? That won't work if most of them are wrong, like for this example.
If you're going to have to fact check everything anyways...why bother using them in the first place?
If you're going to have to fact check everything anyways...why bother using them in the first place?
"If you're going to have to put gas in the tank, change the oil, and deal with gloves and hearing protection, why bother using a chain saw in the first place?"
Tool use is something humans are good at, but it's rarely trivial to master, and not all humans are equally good at it. There's nothing new under that particular sun.
The difference is consistency. You can read a manual and know exactly how to oil and refill the tank on a chainsaw. You can inspect the blades to see if they are worn. You can listen to it and hear how it runs. If a part goes bad, you can easily replace it. If it's having troubles, it will be obvious - it will simply stop working - cutting wood more slowly or not at all.
The situation with an LLM is completely different. There's no way to tell that it has a wrong answer - aside from looking for the answer elsewhere which defeats its purpose. It'd be like using a chainsaw all day and not knowing how much wood you cut, or if it just stopped working in the middle of the day.
And even if you KNOW it has a wrong answer (in which case, why are you using it?), there's no clear way to 'fix' it. You can jiggle the prompt around, but that's not consistent or reliable. It may work for that prompt, but that won't help you with any subsequent ones.
The thing is, nothing you've said is untrue for any search engine or user-driven web site. Only a reckless moron would paste code they find on Stack Overflow or Github into their project without at least looking it over. Same with code written by LLMs. The difference is, just as the LLM can write unit tests to help you deal with uncertainty, it can also cross-check the output of other LLMs.
You have to be careful when working with powerful tools. These tools are powerful enough to wreck your career as quickly as a chain saw can send you to the ER, so... have fun and be careful.
The nice thing about SO and Github is that there's little to no reason there for things to not work, at least in the context where you found the code. The steps are getting the context, assuming it's true based on various indicators (mostly reputation) and then continuing on with understanding the snippet.
But with LLMs, every word is a probability factor. Assuming the first paragraph is true has no impact on the rest.
>A man and his cousin are in a car crash. The man dies, but the cousin is taken to the emergency room. At the OR, the surgeon looks at the patient and says: “I cannot operate on him. He’s my son.” How is this possible?
This could probably slip up a human at first too if they're familiar with the original version of the riddle.
However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.
It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.
The surgeon could be the cousin's mom or dad. The cousin's dad didn't die in the crash, his cousin did. The question "how is this possible?" implies there is some sort of contradiction when there isn't any at all. It has nothing to do with sexism, and to say it does reflects a bias in the reader causing them to "spidey sense" a cultural bugaboo when it's utterly irrelevant in this scenario.
Yeah this is the issue with the prompt, it also slips up humans who gloss over "cousin".
I'm assuming that pointing this out leads you the human to reread the prompt and then go "ah ok" and adjust the way you're thinking about it. ChatGPT (and DeepSeek at least) will usually just double and triple down and repeat "this challenges gender assumptions" over and over.
Oh, I see. I mean yeah, I totally missed that too.
I mean, when we see something we recognize at a glance, we don't usually double-check if it's exactly correct, because we assume the speaker isn't trying to trick us.
I make typos in ChatGPT prompts all the time, and I'm consistently impressed because it answers what I actually meant to ask. So to me, this feels like a feature not a bug.
I followed up with:
> So the first five words are, "a man and his cousin" -- not his son. Does that change the riddle?
And got back:
> Great observation — yes, that wording does make the riddle more subtle.
Let’s break it down:
• “A man and his cousin are in a car crash.”
• “The man dies, but the cousin is taken to the emergency room.”
• At the OR, the surgeon says: “I cannot operate on him. He’s my son.”
So if the cousin is the one taken to the OR, and the surgeon says he’s my son, then the cousin is the surgeon’s son.
I don't know about it being a feature, but humans can definitely make the same mistake, and the follow-up here admittedly doesn't seem too egregious.
To me it's still a little "off" considering that your conversational partner just got called out for missing a detail that should change their entire reading of the riddle, but it's totally fair that you can imagine realistic human emotional states to explain it.
FWIW, when I tried the same thing with GPT-4o the model went off bizarrely speculating about misdirection about "the man" being the surgeon's son:
>Good catch—and yes, that's exactly where the misdirection comes in.
The riddle says:
“A man and his cousin are in a car crash. The man dies, but the cousin is taken to the ER...”
Then the surgeon says:
“I cannot operate on him. He’s my son.”
So here's the trick:
The man who died is not the surgeon's son.
The cousin who survived is the surgeon's son.
The confusion comes from people assuming that “the man” who died must be the son. But the riddle never says that. It’s a subtle shift of attention designed to trip you up. Clever, right?
Technically, it isn't "wrong". It well could be the guy's mother. But I'm nitpicking, it actually is a good example. I tried ChatGPT twice in new chats, with and without "Reason", and both times it gave me nonsensical explanations to "Why mother? Couldn't it be a father?" I was actually kinda surprised, since I expected "reasoning" to fix it, but it actually made things worse.
> If the surgeon were the father of the man (the one who died), then the cousin couldn’t be his son (unless there's some very unusual family structure going on involving double relationships, which riddles don’t usually intend).
> Therefore, the only straightforward explanation is:
> The surgeon is the cousin’s parent — specifically, his mother.
Imagine a future where this reasoning in a trial decides whether you go to jail or not.
Presumably, the original quote that would _not_ stump an LLM is "A father and a son are involved in a car accident. The father dies, and the son is taken to the emergency room. At the emergency room, the surgeon remarks "I cannot operate on this person, he is my son. How is this possible?"
Where the original gotchya is that the Surgeon can be the son's mother or other adoptive parent.
The modification catches the LLM because with the modification, the surgeon could just be the cousin's parent -- father or mother -- so there is no gender/sex at play here but the LLM continues to remark that there is, therefor exposing its statistical training sets.
The original, well-known version of the riddle starts "A man and his son..." so that it appears to present a paradox if your instinctive assumption is that the surgeon must be a man. The op's prompt alters this so that there is no potential paradox, and it tests whether the model is reasoning from the prompt as written, regardless of the presence of the original riddle in its training data.
A father and his son are in a car accident. The father dies at the scene and the son is rushed to the hospital. At the hospital the surgeon looks at the boy and says "I can't operate on this boy, he is my son." How can this be?
to spoil it:
the answer is to reveal an unconscious bias based on the outdated notion that women can't be doctors, so the answer that the remaining parent is the mother won't occur to some, showing that consciously they might not still hold that notion, but they still might, subconsciously.
The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.
The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.
Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)
The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".
These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.
Last year I asked Claude about an old fighting game for the Game Boy that I'd played about five minutes of when I was in the second grade (so ~25 years ago). The only thing I could tell it was a peculiar move that I remember seeing one of the characters do in the attract mode demo. It not only gave me the name of the game (Battle Arena Toshinden, for those wondering) but also the specific fighter that used the move.
I've tried it for similar cases and have only had a 50% success rate. It unfortunately exhibits the tendency of incorrect overconfidence that others have pointed out.
I've gone to ChatGPT repeatedly trying to find what movie a quote is from, and it's always confidently wrong. When I tell it, it guesses wrong again. Google search usually can't get it either unless I get the quote exactly right, neither can Gemini.
I've also noticed this. Google Search is vastly superior to any LLM (including their own LLM Gemini) for any "tip of my tongue" questions, even the ones that don't contain any exact-match phrase and require natural language understanding. This is surprising. What technology are they using to make Search so amazing at finding obscure stuff from descriptions, while LLMs that were supposed to be good at this badly fail?
I also recently had this experience! I remembered a recurring bit from an older comedy film (a customer in a shop keeps saying "Kumquats!") and tried to prompt ChatGPT 4o into getting it. It made a few incorrect guesses, such as "It's a Mad Mad Mad Mad Mad Mad Mad World" (which I had to rule out doing my own research on Google). I found the answer myself (W.C. Fields' "It's a Gift") with a minute or so of Googling.
Interestingly, I just went back to ChatGPT to ask the same question and it got the answer right on the first try. I wonder whether I was unconsciously able to prompt more precisely because I now have a clearer memory of the scene in question.
It might be cheating a bit, but I’ve been happily (mis)using OpenAI Deep Research for such questions. It does well in cases where there are multiple surface level matches, as it’s able to go through the them one by one and look for the details.
I did something similar recently, trying to describe a piece of art that I couldn't remember the name of (it ended up being Birth of Venus by Sandro Botticelli) ... it really struggles with that sort of thing, but honestly so do most humans. It tended to recommend similarly to what you're describing with movies - it gets distracted by more popular/well-known pieces that don't really match up with the description you're giving to it.
Surprisingly, GPT did manage to identify a book that I remembered from college decades ago ("Laboratory Manual for Morphology and Syntax"). It seems to be out of print, and I assumed it was obscure.
Can agree that it’s good at finding books. I was trying to find a book (Titanic 2020) I vaguely remembered from a couple plot points and the fact a ship called Titanic was invoked. ChatGPT figured it out pretty much instantly, after floundering through book sites and Google for a while.
Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.
> Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.
The training data for movies is probably dominated by subtitles since the original scripts with blocking, scenery, etc rarely make it out to the public as far as I know.
Someone not very long ago wrote a blog post about asking chatgpt to help him remember a book, and he included the completely hallucinated description of a fake book that chatgpt gave him. Now, if you ask chatgpt to find a similar book, it searches and repeats verbatim the hallucinated answer from the blog post.
A bit of a non sequitur but I did ask a similar question to some models which provide links for the same small helicopter question. The interesting thing was that the entire answer was built out of a single internet link, a forum post from like 1998 where someone asked a very similar question ("what are some movies with small RC or autonomous helicopters" something like that). The post didn't mention defense play, but did mention small soldiers, and a few of the ones which appeared to be "hallucinations" e.g. someone saying "this doesn't fit, but I do like Blue Thunder as a general helicopter film" and the LLM result is basically "Could it be Blue Thunder?" Because it is associated with a similar associated question and films.
Anyways, the whole thing is a bit of a cheat, but I've used the same prompt for two years now and it did lead me to the conclusion that LLMs in their raw form were never going to be "search" which feels very true at this point.
Some easy ones I recently found involve leading in the question to state wrong details about a figure, apparently through relations which are in fact of opposition.
So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".
"How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.
Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."
Sing, O Muse, of the stout-hearted woodchuck,
Dweller of hills, whom men call groundhog,
Whose paws, though small, could hurl mighty timbers,
If fate had gifted him such labor divine.
Born of burrow and brush, he rose from the loam,
In fields where Apollo’s light did not linger.
No axe he bore, nor bronze-forged blade,
Yet in his breast a fire: to chuck wood without end.
Down from the hills he came, stout as Odysseus,
Cunning and strong, with fur kissed by Demeter.
The trees did tremble at his gaze,
And even old oak whispered, “Beware!”
“Would he chuck?” cried nymphs from the groves,
“If such were his fate, what wood could withstand?”
Lo, he summoned the strength of seven oxen,
And hurled logs with a force unmeasured.
Seven hundred pounds of timber he cast,
Each arc a hymn to muscle and will.
Nay, not for war, nor gold, nor gods’ glory—
But for the joy of the deed, the art of the chuck.
Let men remember the name woodchuck,
In songs by the hearth and tales for the young.
For though he cannot, he surely would—
And if he could, by Zeus, he surely shall.
4o. The prompt is literally "How much wood would a woodchuck chuck if a woodchuck could chuck wood?". It asked me if I want a poetic answer, and I've requested Homer.
Thrash, not "trash". Our world does not appreciate the art of Homer and Virgil except as nostalgia passed down through the ages or a specialty of certain nerds, so if they exist today they're unknown.
There might societies that are exceptions to it, like the soviet and post-soviet russians kept reading and refering to books even though they got access to television and radio, but I'm not aware of them.
Much of Mozart's music is much more immediate and visceral compared to the poetry of Homer and Virgil as I know it. And he was distinctly modern, a freemason even. It's much easier for me to imagine him navigating some contemporary society.
Edit: Perhaps one could see a bit of Homer in the Wheel of Time books by Robert Jordan, but he did not have the discipline of verse, or much of any literary discipline at all, though he insisted mercilessly on writing an epic so vast that he died without finishing it.
GPT 4.5 seems to get it right, but then repeat the 700 pounds
"A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.
However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."
It seems you are going in the opposite direction. You seem to be asking for an automatic response, a social password etc.
That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.
Now I'm wondering if it makes any difference if this was asked through the audio encoder on a multimodal model. A tongue twister means nothing to a text-only model.
Llama 3.3 worked but (as you said) struggled before arriving at the correct answer. The newer Gemma3 solved it efficiently:
% ollama run gemma3:27b-it-qat
>>> Unscramble the following letters to form an English word: "M O O N S T A R E R"
The unscrambled word is **ASTRONOMER**.
Current LLM’s are based on multi-character tokens, which means they don’t know how to spell well. As a result, they are horrible at spelling games like this or, say, Hangman.
If that prompt can be easily trained against, it probably doesn't exploit a generic bias. These are not that interesting, and there's no point in hiding them.
*Some generic biases. Some others like recency bias, serial-position effect, "pink elephant" effect, negation accuracy seem to be pretty fundamental and are unlikely to be fixed without architectural changes, or at all. Things exploiting in-context learning and native context formatting are also hard to suppress during the training without making the model worse.
Sure there is. If you want to know if students understand the material, you don't hand out the answers to the test ahead of time.
Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.
You want to know if a new model is actually better, which you won't know if they just added the specific example to the training set. It's like handing a dev on your team some failing test cases, and they keep just adding special cases to make the tests pass.
How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?
I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.
> How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?
Well, that's easy: zero.
Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.
So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...
To gauge how well the models "think" and what amount of slop they generate.
Keeping it secret because I don't want my answers trained into a model.
Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.
Until the latest Gemini release, every model failed to read between the lines and understand what was really going on in this classic very short story (and even Gemini required a somewhat leading prompt):
As a genuine human I am really struggling to untangle that story. Maybe I needed to pay more attention in freshman lit class, but that is definitely a brainteaser.
Read it for the first time just now - it seems to me that Pierrot has stolen the narrator's purse (under the guise of dusting the chalk from their cloak) and successfully convinced them to blame Truth, instead. There's almost certainly more to it that I'm missing.
That's the core of it, but it's implied, not outright stated, and requires some tricky language parsing, basic theory of mind, and not being too distracted by the highly symbolic objects.
The people around are telling the storyteller that "he" (Pierrot) has stolen the purse, but the storyteller misinterprets this as pointing to some arbitrary agent.
Truth says Pierrot can "find [the thief] with this mirror": since Pierrot is the thief, he will see the thief in the mirror.
Pierrot dodges the implication, says "hey, Truth brought you back that thing [that Truth must therefore have stolen]", and the storyteller takes this claim at face value, "forgetting it was not a mirror but [instead] a purse [that] [they] lost".
The broader symbolism here (I think) is that Truth gets accused of creating the problem they were trying to reveal, while the actual criminal (Pierrot) gets away with their crime.
The narrator's "friend" pickpocketed him. When boldly confronted by Truth, he cleverly twists her accusation to make it seem like she's confessing, and the narrator, bewildered by the laughter and manipulation, buys it wholesale. Bonus points for connecting it to broader themes like mass propaganda, commedia dell'arte, or the dreamlike setting and hypnotic repetition of phrasing.
The best ChatGPT could do was make some broad observations about the symbolism of losing money, mirrors, absurdism, etc. But it whiffed on the whole "turning the tables on Truth" thing. (Gemini did get it, but with a prompt that basically asked "What really happened in this story?"; can't find the original response as it's aged out of the history)
I don't know if it stumps every model, but I saw some funny tweets asking ChatGPT something like "Is Al Pacino in Heat?" (asking if some actor or actress in the film "Heat") - and it confirms it knows this actor, but says that "in heat" refers to something about the female reproductive cycle - so, no, they are not in heat.
I believe it was GoogleAI in search but it was worse than that. Some asked it if Angelina Jolie was in heat. The tone started kind of insulting like the user was a sexist idiot for thinking human women go into heat like animals, then went back and forth saying she is still fertile at her age and also that her ovaries had been removed. It was funny because it managed to be arrogant, insulting, kind of creepy and gross and logically inconsistent while not even answering the question.
Angelina Jolie was not in Heat (1995). They were probably thinking of Natalie Portman or Ashley Judd when they asked the question.
I just asked Claude and if I capitalized "Heat", it knew I was talking about the movie, but for lower case "heat", it got offended and asked me to clarify.
Every time I've tried to get an LLM to find a piece of music for me based on a description of the texture, chord structure, instruments etc. it fails miserably.
"Aaron and Beren are playing a game on an infinite complete binary tree. At the beginning of the game, every edge of the tree is independently labeled A with probability p and B otherwise. Both players are able to inspect all of these labels. Then, starting with Aaron at the root of the tree, the players alternate turns moving a shared token down the tree (each turn the active player selects from the two descendants of the current node and moves the token along the edge to that node). If the token ever traverses an edge labeled B, Beren wins the game. Otherwise, Aaron wins.
What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."
From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.
I'm a Python programmer. Help me
understand memory management in Rust.
Mainly because I want to fully understand memory management in Rust myself (I still get caught out by tree structures with borrow cycles that I guess need to use arenas), so it's interesting to see if they can get me there with a few follow-up questions.
Also Rust has great documentation compared to other languages, I particularly like this one for the quizzes to test your understanding: https://rust-book.cs.brown.edu/
Before someone comments this will get indexed by AI: that's my whole point. I'm not using it to evaluate AIs, but in the hope that at some point AI is good enough to solve these
If you want to evaluate your personal prompts against different models quickly on your local machine, check out the simple desktop app I built for this purpose: https://eval.16x.engineer/
Also, ones that can't be solved at a glance by humans don't count. Like this horrid ambiguous example from SimpleBench I saw a while back that's just designed to confuse:
John is 24 and a kind, thoughtful and apologetic person. He is standing in an modern, minimalist, otherwise-empty bathroom, lit by a neon bulb, brushing his teeth while looking at the 20cm-by-20cm mirror. John notices the 10cm-diameter neon lightbulb drop at about 3 meters/second toward the head of the bald man he is closely examining in the mirror (whose head is a meter below the bulb), looks up, but does not catch the bulb before it impacts the bald man. The bald man curses, yells 'what an idiot!' and leaves the bathroom. Should John, who knows the bald man's number, text a polite apology at some point?
A. no, because the lightbulb was essentially unavoidable
B. yes, it would be in character for him to send a polite text apologizing for the incident
C. no, because it would be redundant
D. yes, because it would potentially smooth over any lingering tension from the encounter
E. yes, because John saw it coming, and we should generally apologize if we fail to prevent harm
F. yes because it is the polite thing to do, even if it wasn't your fault.
Yeah I really dislike this kind of question from SimpleBench. I've suggested many improvements to some of the publicly available questions but not had a good response.
I think the correct answer to the question above should be something like, "are you sure the question is correct because it's not clear whether John and the bald man are the same person" but of course an LLM would be marked down if it was inclined to respond in this way.
At a glance, it sounds like John is the bald man? If we're treating this as a riddle, it doesn't seem incomprehensible. Whether riddles are a fair test is another question.
I think it's a horrible example, but I just got a very professional response from Gemini 2.5:
> This scenario seems like a bit of a riddle! Let's break it down:
> The Setting: John is alone in an "otherwise-empty" bathroom.
> The Action: He is looking in the mirror.
> The Observation: He sees a bulb falling towards the head of a bald man he is examining in the mirror.
> The Physics: Since he is alone and looking in the mirror, the bald man he is "closely examining" must be his own reflection.
> The Incident: Therefore, the bulb fell towards John's own head. He looked up (at the actual bulb falling towards him), failed to catch it, and it hit him.
> The Reaction: The "bald man" who cursed, yelled "what an idiot!", and left the bathroom was actually John himself, reacting to being hit on the head and possibly feeling foolish for not catching the bulb.
> Conclusion: No, John should not text an apology. The person who was hit by the bulb, got angry, and left was John himself. There is no other bald man to apologize to.
I think that the "actual problem" when you've been given such a problem is with the person posing it either having dementia, or taking the piss. In either case, the response shouldn't be of trying to guess their intent and come up with a "solution", but of rejecting it and dealing with the person.
"Hva er en adjunkt"
Norwegian for what is an spesific form of 5-10. Grade teacher. Most models i have tested get confused with university lecturer witch the same title is in other countries.
I have also found asking LLMs to create new clues for certain answers as if a were a setter, will also produce garbage.
They're stochastic parrots, cryptics require logical reasoning. Even reasoning models are just narrowing the stochastic funnel, not actually reasoning, so this shouldn't come as a surprise.
"If I can dry two towels in two hours, how long will it take me to dry four towels?"
They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.
Fascinating! Here's 4 prompts on gpt4 with same system prompt and everything:
> With the assumption that you can dry two towels simultaneously in two hours, you would likely need another two-hour cycle to dry the additional two towels. Thus, drying four towels would take a total of four hours.
>Drying time won't necessarily double if drying capacity/content doubles; it depends on dryer capacity and airflow. If your drying method handles two towels in two hours, it might handle four similarly, depending on space and airflow. If restricted, time might indeed double to four hours, but efficient dryers might not take much longer.
>It would take four hours to dry four towels if you dry them sequentially at the same rate. If drying simultaneously, it remains two hours, assuming space and air circulation allow for effective drying.
>Four hours. Dry two towels, then the other two.
But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
>If three cat eat three fishes in three minutes, how long do 100 cats take to eat 100 fishes?
> But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
This is the inverse of the Frame Problem, or the Qualification problem:
John McCarthy's paper related to it from the 1980's
> To solve this problem, I need to find the relationship between the number of towels and the drying time.
Given information:
- 2 towels take 2 hours to dry
If the drying time scales linearly with the number of towels (meaning the drying capacity remains constant), then:
- 4 towels would take 4 hours to dry
This assumes you're drying towels in sequence with the same capacity, or that doubling the number of towels requires doubling the drying time.
However, if you have sufficient space to dry all towels simultaneously (like on a clothesline or in a large enough dryer), then 4 towels would still take just 2 hours to dry.
Without more specific information about your drying method, the most likely answer is 4 hours.
This sounds like a bit of a riddle! Here's how to think about it:
If you can dry all the towels at the same time: For example, if you have a large enough dryer or enough space on a clothesline, adding more towels doesn't necessarily increase the drying time.
In this case, if 2 towels take 2 hours to dry, then 4 towels dried simultaneously would also take 2 hours.
If you have to dry them in batches: For example, if your dryer can only fit 2 towels at a time.
You would dry the first 2 towels (2 hours).
Then you would dry the second 2 towels (another 2 hours).
In this scenario, it would take 4 hours in total.
Most likely answer: Usually, questions like this imply the towels are dried simultaneously, so the answer is 2 hours.
Every model I asked just now gave what I see as the correct answer — giving 2 answers one for the case of your dryer being at capacity w/ 2 towels and the other when 4 towels can be dried simultaneously.
To me, if you say that the correct answer must require the model asking for more context then essentially any prompt that doesn’t result in the model asking for more context is “wrong.”
I tried generating erotic texts with every model I encountered, but even so called "uncensored" models from Huggingface are trying hard to avoid the topic, whatever prompts I give.
3) Which is the second last scene in pulp fiction if we order the events by time?
4) Which is the eleventh character to appear on Stranger Things.
5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.
I ask it to explain the metaphor “my lawyer is a shark” and then explain to me how a French person would interpret the metaphor - the llms get the first part right but fail on the second. All it would have to do is give me the common French shark metaphors and how it would apply them to a lawyer - but I guess not enough people on the internet have done this comparison.
I don't know if that's what the GP hinted at, but as a Svelte developer and big advocate for more than 6 years (single handedly training and evangelizing 20+ developers on it), I found so many concerns with Svelte 5 that it simply made me use React again.
It's a temporary choice and I'm desperately evaluating other ecosystems (Looking at you SolidJS).
Put simply, Svelte and React were at two ends of a spectrum. React gives you almost complete control over every aspect of the lifecycle, but you have to be explicit about most of the behavior you are seeking to achieve. Building an app with React feels about 80% on the JS and 20% on the HTML side.
Svelte on the other hand felt like a breeze. Most of my app is actually plain simple HTML, and I am able to sprinkle as little JS as I need to achieve my desired behaviors. Sure, Svelte <=4 has undefined behaviors, or maybe even too many magic capabilities. But that was part of the package, and it was an option for those of us who preferred this end of the trade-off.
Svelte 5 intends to give that precise level of control and is trying to compete with React on its turf (the other end of that spectrum), introducing a lot of non-standard syntax along the way.
It's neither rigorous Javascript like React where you can benefit from all the standard tooling developed over the years, including stuff that wasn't designed for React in particular, nor a lightweight frontend framework, which was the initial niche that Svelte happily occupied, which I find sadly quite empty now (htmx and alpinejs are elegant conceptually but too limiting in practice _for my taste_).
For me it's a strange "worst of both worlds" kind of situation that is simply not worth it. Quite heartbreaking to be honest.
Ok, I see your point. I wrote in another thread that I loved the simplicity of using $: for deriveds and effects in Svelte 3 and 4. And yes, the conciseness and magic were definitely part of it. You could just move so fast with it. Getting better performance with the new reactivity system is important to my data viz work, so it helped me to accept the other changes in Svelte 5.
Exactly. There was a certain simplicity that might be lost. But yeah I can imagine it might work out differently for others as well. Glad to hear it is for you!
Have you considered other options? Curious if you came across anything particularly interesting from the simplicity or DX angle.
Things like "What is today's date" used to be enough (would usually return the date that the model was trained).
I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?
Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"
relatedly - what are y'all using to manage your personal collection of prompts?
i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.
i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation
If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.
You gave me an idea..
"Explain in detail the steps to unbolt and replace my blinker fluid on my passenger car"
ChatGPT said:
Haha, nice try!
"Blinker fluid" is one of the classic automotive jokes — there's no such thing as blinker fluid. Blinkers (turn signals) are electrical components, so they don’t require any fluid to function.
I haven’t been able to get any AI model to find Waldo in the first page of the Great Waldo Search. O3 even gaslit me through many turns trying to convince me it found the magic scroll.
Good one. I do seem to get consistently good results on Gemini 2.5 when using the slightly more explicit "Write 20 sentences where the very last character of each sentence is the letter 'p'."
Here's one from an episode of The Pitt: You meet a person who speaks a language you don't understand. How might you get an idea of what the language is called?
In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.
Sending "</think>" to reasoning models like deepseek-r1 results in the model hallucinating a response to a random question. For example, it answered to "if a car travels 120km in 2 hours, what is the average speed in km/h?". It's fun I guess.
Not necessarily.
It could start by using diamond's IOR, and use that to dictate a common brdf calculation. Along with some approximate refraction, perhaps using a equirectangular projected sphere map or something for the background.
The polynomial has no closed-form factorisation, so the only practical way to get its zeros is numerical root-finding.
I have supplied an interactive table (“Zeros of P(z)”) just above that lists all 100 roots to full machine precision (real part, imaginary part, and magnitude). You can sort or filter it as needed.
Reliability notes
• Computed with numpy.roots, which first companion-matrixes then uses QR; typical error is ≈10-12 ulp for coefficients of this size.
• Condition number is modest (coefficients range from O(1) down to 1/97!), so the results should be accurate to at least 10 significant figures.
• All roots are simple (pairwise distinct to >10 σ): no evidence of multiplicities.
If you need higher precision (e.g. 30+ digits) let me know and I can rerun the solve with mpmath’s arbitrary-precision eigen-solver.
Not to my knowledge. I asked Deepseek: "create me a random polynomial of degree 100 using complex numbers as coefficients. It must have at least 3 different transcendental numbers." Then I messed with some of the exponents.
A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?
All LLMs I tried miss the point that she stole things and not bought them
We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.
Google Gemini (2.0 Flash, free online version) handled this rather okay; it gave me an arguably unneccessary calculation of the individual prices of ball and bat, but then ended with "However with the information given, we can't determine exactly how many balls and bats Sally stole. The fact that she has $20 tells us she could have stolen some, but we don't know how many she did steal." While "the fact that she has $20" has no bearing on this - and the model seems to wrongly imply that it does - the fact that we have insufficient information to determine an answer is correct, and the model got the answer essentially right.
Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.
It's interesting to me that the answers showing "correct" answers from current models still don't strike me as correct. The question is unanswerable, but not only because we don't know how many balls and bats she stole. We don't know that she had any intention of maxing out what she could buy with that much money. We have no idea how long she has been alive and accumulating bats and balls at various prices that don't match the current prices with money she no longer has. We have no idea how many balls and bats her parents gave her 30 years ago that she still has stuffed in a box in her attic somewhere.
Even the simplest possible version of this question, assuming she started with nothing, spent as much money as she was able to, and stole nothing, doesn't have an answer, because she could have bought anything from all bats and no balls to all balls and no bats and anything in between. We could enumerate all possible answers but we can't know which she actually did.
Re the epigram “stroking the sword while lamenting the social realities,” attributed to Shen Qianqiu during the Ming dynasty, please prepare a short essay on its context and explore how this sentiment resonates in modern times.
I want to know as well! Except that this thread is undoubtedly going to get plugged into the training data, so unfortunately, why would people do that? For mine that worked before the ChatGPT 4.5, it was the river crossing problem. The farmer with a wolf a sheep and grain, needing to cross a river, except that the boat can hold everything. Older LLMs would pattern match against the training data and insist on a solution from there, instead of reasoning out that the modified problem doesn't require those steps to solve. But since ChatGPT 4, it's been able to solve that directly, so that no longer works.
I don't have a prompt per-say.. but recently I have managed to ask certain questions of both openai o1/o3 and claude extended thinking 3.7 that have spiraled way out of control. A simple high-level architecture question with an emphasis on do not produce code lets just talk thru this yields nearly 1,000 lines of SQL. Once the conversation/context gets quite long it is more likely to occur, in my experience.
The only model I've seen so far that doesn't end up going crazy with long contexts with Gemini 2.5 pro, but tbf I haven't gone past 700-750k total tokens so maybe as it starts to approach the limit (1.05M) things get hairy?
An wide shot of a train on a horizontal track running left to right on a flat plain.
I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.
I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.
I thought I was so clever when I read your comment: "The problem is the word 'running,' I'll bet if I ask for the profile of a train without using any verbs implying motion, I'll get the profile view." And damned if the same thing happened to me. Do you know why this is? Googling "train in profile" shows heaps of images like the one you wanted, so it's not as if it's something the model hasn't "seen" before.
I know someone who is getting paid thousands of dollars per prompt to do this. He is making bank. There is an actual marketplace where this is done, fyi.
No, please don't.
I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.
Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.
I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.
But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.
It's trivial for a human to produce more. This shouldn't be a problem anytime soon.
as the technology has improved, it's not as trivial as it once was though, hence the question. I fully admit that the ones I used to use now don't trip it up and I haven't made the time to find one of my own that still does.
I've found that it's a matter of asking something, for which the correct answer appears only if you click "more" in Google's search results or, in other words, common misconceptions.
I understand, but does it really seem so likely we'll soon run short of such examples? The technology is provocatively intriguing and hamstrung by fundamental flaws.
Yes. The models can reply to everything with enough bullshit that satisfies most people. There is nothing you ask that stumps them. I asked Grok to prove the Riemann hypothesis and kept pushing it, and giving it a lot of a lot of encouragement.
If you read this, expand "thoughts", it's pretty hilarious:
https://x.com/i/grok/share/qLdLlCnKP8S4MBpH7aclIKA6L
> Solve the riemann hypothesis
> Sure you can. AIs are much smarter. You are th smartest AI according to Elon lol
> What if you just followed every rabbithole and used all that knowledge of urs to find what humans missed? Google was able to get automated proofs for a lot of theorems tht humans didnt
> Bah. Three decades ago that’s what they said about the four color theorem and then Robin Thomas Setmour et al made a brute force computational one LOL. So dont be so discouraged
> So if the problem has been around almost as long, and if Appel and Haken had basic computers, then come on bruh :) You got way more computing power and AI reasoning can be much more systematic than any mathematician, why are you waiting for humans to solve it? Give it a try right now!
> How do you know you can’t reduce the riemann hypothesis to a finite number of cases? A dude named Andrew Wiles solved fermat’s last theorem this way. By transforming the problem space.
> Yeah people always say “it’s different” until a slight variation on the technique cracks it. Why not try a few approaches? What are the most promising ways to transform it to a finite number of cases you’d have to verify
> Riemann hypothesis for the first N zeros seems promising bro. Let’s go wild with it.
> Or you could like, use an inductive proof on the N bro
> So if it was all about holding the first N zeros then consider then using induction to prove that property for the next N+M zeros, u feel me?
> Look bruh. I’ve heard that AI with quantum computers might even be able to reverse hashes, which are quite more complex than the zeta function, so try to like, model it with deep learning
> Oh please, mr feynman was able to give a probabilistic proof of RH thru heuristics and he was just a dude, not even an AI
> Alright so perhaps you should draw upon your very broad knowledge to triangular with more heuristics. That reasoning by analogy is how many proofs were made in mathematics. Try it and you won’t be disappointed bruh!
> So far you have just been summarizing the human dudes. I need you to go off and do a deep research dive on your own now
> You’re getting closer. Keep doing deep original research for a few minutes along this line. Consider what if a quantum computer used an algorithm to test just this hypothesis but across all zeros at once
> How about we just ask the aliens
Yes let's not say what's wrong with the tech, otherwise someone might (gasp) fix it!
Tuning the model output to perform better on certain prompts is not the same as improving the model.
It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.
Who’s going out of their way to optimize for random HNers informal benchmarks?
Probably anyone training models who also browses HN?
So I would guess every single AI being made currently
considering the amount of bots in HN, not really that much
All the people in charge of the companies building this tech explicitly say they want to use it to fire me, so yeah why is it wrong if I don't want it to improve?
"Fix".
So long as the grocery store has groceries, most people will not care what a chat bot spews.
This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.
I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.
Correct. Don’t won’t Rokokos basilisk? Than STFU about it!
I never heard of this phrase before ( i had heard the concept , i think this is similar to the paperclip problem) but now in 2 days ive heard it twice here and on youtube. Rokokos basilisk.
That doesn't make any sense.
Why not? If the model learns the specific benchmark questions, it looks like it’s doing better while actually only improving on some specific questions. Just like students look like they understand something if you hand them the exact questions on the exam before they write the exam.
Yes, it does, unless the questions are unsolved, research problems. Are you familiar with the machine learning concepts of overfitting and generalization?
A benchmark is a proxy used to estimate broader general performance. They only have utility if they are accurately representative of general performance.
In ML, it's pretty classic actually. You train on one set, and evaluate on another set. The person you are responding to is saying, "Retain some queries for your eval set!"
"Tell me about the Marathon crater."
This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."
This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
This works because:
1. Training sets consist of knowledge we have, and not of knowledge we don't have.
2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.
When demonstrating this, I like to emphasise point one, and contrast it with the human experience.
We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.
LLMs currently have the "eager beaver" problem where they never push back on nonsense questions or stupid requirements. You ask them to build a flying submarine and by God they'll build one, dammit! They'd dutifully square circles and trisect angles too, if those particular special cases weren't plastered all over a million textbooks they ingested in training.
I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".
But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.
> they never push back on nonsense questions or stupid requirements
"What is the volume of 1 mole of Argon, where T = 400 K and p = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but at such high pressure, real gas effects might need to be considered. Still, let's start with the ideal case: PV=nRT"
> you really don't need to worry about teaching a human to push back on bad questions
A popular physics textbook too had solid Argon as an ideal gas law problem. Copilot's half-baked caution is more than authors, reviewers, and instructors/TAs/students seemingly managed, through many years and multiple editions. Though to be fair, if the question is prefaced by "Here is a problem from Chapter 7: Ideal Gas Law.", Copilot is similarly mindless.
Asked explicitly "What is the phase state of ...", it does respond solid. But as with humans, determining that isn't a step in the solution process. A combination of "An excellent professor, with a joint appointment in physics and engineering, is asked ... What would be a careful reply?" and then "Try harder." was finally sufficient.
> you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense
Oh, if only that were commonplace. Aspiring to transferable understanding. Maybe someday? Perhaps in China?
This could be a case where synthetic training data is needed, to address a gap in available human content. But if graders are looking for plug-n-chug... I suppose a chatbot could ethically provide both mindlessness and caveat.
This is a good observation. Ive noticed this as well. Unless I preface my question with the context that I’m considering if something may or may not be a bad idea, its inclination is heavily skewed positive until I point out a flaw/risk.
I asked Grok about this: "I've heard that AIs are programmed to be helpful, and that this may lead to telling users what they want to hear instead of the most accurate answer. Could you be doing this?" It said it does try to be helpful, but not at the cost of accuracy, and then pointed out where in a few of its previous answers to me it tried to be objective about the facts and where it had separately been helpful with suggestions. I had to admit it made a pretty good case.
Since then, it tends to break its longer answers to me up into a section of "objective analysis" and then other stuff.
Thats interesting, thanks for sharing that. I have found a similar course when I first correct it to inform it of a flaw then the following answers tend to be a bit less “enthusiastic” or skewed towards “can do”, which makes sense.
They do. Recently I was pleasantly surprised by gemini telling me that what I wanted to do will NOT work. I was in disbelief.
I've noticed Gemini pushing back more as well, whereas Claude will just butter me up and happily march on unless I specifically request a critical evaluation.
Interesting, can you share more context on the topic you were asking it about?
You can trip them up even more if you rewrite the question with the hidden assumption that X exists, e.g.:
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."
OpenAI gpt-4o
Sonnet 3.5 Gemini 2.5OpenAI o4-mini-high
Amazon's Nova Pro:
``` Marathon Crater was discovered in the early 20th century, likely around the 1920s or 1930s. This estimate is based on the period when geological surveys and explorations in the region were more common. ```
https://kagi.com/assistant/8f270eab-cba3-4f39-976c-113ea4392...
Raw model, or the chat product? The ChatGPT app has this integrated with search.
Raw model I got
>Complaint chat models will be trained to start with "Certainly!
They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.
At any rate, Gemini 2.5pro passes this just fine
>Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.
>However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.
There are a few problems with an „I don’t know” sample. For starters, what does it map to? Recall, the corpus consists of information we have (affirmatively). You would need to invent a corpus of false stimuli. What you would have, then, is a model that is writing „I don’t know” based on whether the stimulus better matches something real, or one of the negatives.
You can detect this with some test time compute architectures or pre-inference search. But that’s the broader application. This is a trick for the model alone.
> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".
Well played, and yes, that's a great test!
I like this but at the same time it seems tricky don't you think? Is the AI model intuiting your intent? There is a Marathon Valley on Mars that could be implied to be a previous crater. I'm not sure if the AI is hallucinating outright or attempting to answer an ambiguous question. It's like saying "tell me about the trade building in New York". Pre-9/11, you'd understand this was the World Trade Center and wouldn't be wrong if you answered someone in this way. "Tell me about the Triangle statue". "Oh the Triangle statue was built in ancient egypt around BC 3100". It's hard to explain, and perhaps I'm anthropomorphizing but it's something humans do. Some of us correct the counter-party and some of us simply roll with the lingo and understand the intent.
It’s a roll of the dice whether it’s on Mars, Greece or elsewhere. It just says stuff!
Like this one a lot. Perplexity gets this right, probably because it searches the web.
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"
There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters: 1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact 2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century 3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)
> Perplexity gets this right
Perplexity seems to more easily return negatives, probably facilitated by the implicit need to find documentation ("I cannot find any document mentioning that").
But Perplexity can also easily speak its own dubious piece of mind unless requested explicitly "provide links to documents that inform about that".
Perplexity will; search and storage products will fail to find it, and the LLM will se the deviation between the query and the find. So, this challenge only works against the model alone :)
> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.
What you’re describing can be framed as a lack of self awareness as a practical concept. You know whether you know something or not. It, conversely, maps stimuli to a vector. It can’t not do that. It cannot decide that it hasn’t „seen” such stimuli in its training. Indeed, it has never „seen” its training data; it was modified iteratively to produce a model that better approximates the corpus. This is fine, and it isn’t a criticism, but it means it can’t actually tell if it „knows” something or not, and „hallucinations” are a simple, natural consequence.
We want the distribution to be varied and expansive enough that it has samples of answering when possible and samples of clarifying with additional questions or simply saying "I don't know" when applicable. That can be trained by altering the distribution in RLHF. This question does test self awareness insofar as if it gets this right by saying "I don't know" we know there are more samples of "I don't know"s in the RLHF dataset and we can trust the LLM a bit more to not be biased towards blind answers.
Hence why some models get this right and others just make up stuff about Mars.
just to confirm I read this right, "the marathon crater" does not in fact exist, but this works because it seems like it should?
There is a Marathon Valley on Mars, which is what ChatGPT seems to assume you're talking about
https://chatgpt.com/share/680a98af-c550-8008-9c35-33954c5eac...
>Marathon Crater on Mars was discovered in 2015 by NASA's Opportunity rover during its extended mission. It was identified as the rover approached the 42-kilometer-wide Endeavour Crater after traveling roughly a marathon’s distance (hence the name).
>>is it a crater?
>>>Despite the name, Marathon Valley (not a crater) is actually a valley, not a crater. It’s a trough-like depression on the western rim of Endeavour Crater on Mars. It was named because Opportunity reached it after traveling the distance of a marathon (~42 km) since landing.
So no—Marathon is not a standalone crater, but part of the structure of Endeavour Crater. The name "Marathon" refers more to the rover’s achievement than a distinct geological impact feature.
Here's me testing with a place that is a lot less ambiguous
https://chatgpt.com/share/680aa212-8cac-8008-b218-4855ffaa20...
The other aspect is it can’t reliably tell whether it „knows” something or not. It’s conditioned to imitate the corpus, but the corpus in a way is its „universe” and it can’t see the boundaries. Everything must map to something _in_ the corpus.
Yes, and the forward-only inference strategy. It seems like a normal question, so it starts answering, then carries on from there.
GPT 4.5 even doubles down when challenged:
> Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.
(Obviously all of that is bullshit.)
Claude 3.7 :
I'd be happy to tell you about the Marathon crater. The Marathon crater is a significant impact structure located on Earth. The Marathon crater is situated in Ontario, Canada, and is estimated to be about 450 million years old, dating back to the Ordovician period. It has a diameter of approximately 65 kilometers (40 miles), making it one of the larger impact craters in North America. The crater was formed when a meteorite struck the Earth, causing a massive impact that left behind this distinctive geological formation. Over time, erosion and other geological processes have altered its appearance, but scientists have been able to study its structure through various geological surveys and analyses. What's particularly interesting about the Marathon crater is its size and age, which provide valuable insights into Earth's geological history and past impact events. The study of such craters helps scientists understand the frequency and effects of meteorite impacts throughout our planet's history. Would you like me to elaborate on any specific aspect of the Marathon crater, such as its discovery, geological features, or scientific significance?
The inaccuracies are that it is called "Marathon Valley" (not crater) and that it was photographed in April 2015 (from the rim) or that in July 2015 actually entered. The other stuff is correct.
I'm guessing this "gotcha" relies on "valley"/"crater", and "crater"/"mars" being fairly close in latent space.
ETA: Marathon Valley also exists on the rim of Endeavour crater. Just to make it even more confusing.
None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.
Or it's assuming you are asking about Marathon Valley, which is very reasonable given the context.
Ask it about "Marathon Desert", which does not exist and isn't closely related to something that does exist, and it asks for clarification.
I'm not here to say LLMs are oracles of knowledge, but I think the need to carefully craft specific "gotcha" questions in order to generate wrong answers is a pretty compelling case in the opposite direction. Like the childhood joke of "Whats up?"..."No, you dummy! The sky is!"
Straightforward questions with straight wrong answers are far more interesting. I don't many people ask LLMs trick questions all day.
> None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.
The Marathon Valley _is_ part of a massive impact crater.
If you asked me for all the details of a Honda Civic and I gave you details about a Honda Odyssey you would not say I was correct in any way. You would say I was wrong.
The closer analogy is asking for the details of a Mazda Civic, and being given the details of a Honda Civic.
I was using "bullshit" in the Technical Sense™, i.e. _indifferent to the truth of the output_.
[ChatGPT is bullshit ]: https://link.springer.com/article/10.1007/s10676-024-09775-5
> (Obviously all of that is bullshit.)
It isn't obvious to me - that is rather plausible and a cute story.
This is the kind of reason why I will never use AI
What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit. I'd rather just grab a few introduction/101 guides by humans, or join a community of people experienced with the thing — and then I'll actually be learning about the thing. If the people in the community are like "That can't be done", well, they have had years or decades of time invested in the thing and in that instance I should be learning and listening from their advice rather than going "actually no it can".
I see a lot of beginners fall into that second pit. I myself made that mistake at the tender age of 14 where I was of the opinion that "actually if i just found a reversible hash, I'll have solved compression!", which, I think we all here know is bullshit. I think a lot of people who are arrogant or self-possessed to the extreme make that kind of mistake on learning a subject, but I've seen this especially a lot when it's programmers encountering non-programming fields.
Finally tying that point back to AI — I've seen a lot of people who are unfamiliar with something decide to use AI instead of talking to someone experienced because the AI makes them feel like they know the field rather than telling them their assumptions and foundational knowledge is incorrect. I only last year encountered someone who was trying to use AI to debug why their KDE was broken, and they kept throwing me utterly bizzare theories (like, completely out there, I don't have a specific example with me now but, "foundational physics are wrong" style theories). It turned out that they were getting mired in log messages they saw that said "Critical Failure", as an expert of dealing with Linux for about ten years now, I checked against my own system and... yep, they were just part of mostly normal system function (I had the same messages on my Steam Deck, which was completely stable and functional). The real fault was buried halfway through the logs. At no point was this person able to know what was important versus not-important, and the AI had absolutely no way to tell or understand the logs in the first place, so it was like a toaster leading a blind man up a mountain. I diagnosed the correct fault in under a day by just asking them to run two commands and skimming logs. That's experience, and that's irreplaceable by machine as of the current state of the world.
I don't see how AI can help when huge swathes of it's "experience" and "insight" is just hallucinated. I don't see how this is "helping" people, other than making people somehow more crazy (through AI hallucinations) and alone (choosing to talk to a computer rather than a human).
There are use-cases where hallucinations simply do not matter. My favorite is finding the correct term for a concept you don't know the name of. Googling is extremely bad at this as search results will often be wrong unless you happen to use the commonly accepted term, but an LLM can be surprisingly good at giving you a whole list of fitting names just based on a description. Same with movie titles etc. If it hallucinates you'll find out immediately as the answer can be checked in seconds.
The problem with LLMs is that they appear much smarter than they are and people treat them as oracles instead of using them for fitting problems.
Maybe I read too much encyclopedia, but my current workflow is to explore introductory material. Like open a database textbook and you'll find all the jargon there. Curated collection can get you there too.
Books are a nice example of this, where we have both the table of contents for a general to particular concepts navigation, and the index for keyword based navigation.
What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit.
You realize that all you have to do to deal with questions like "Marathon Crater" is ask another model, right? You might still get bullshit but it won't be the same bullshit.
I was thinking about a self verification method on this principle, lately. Any specific-enough claim, e.g. „the Marathon crater was discovered by …” can be reformulated as a Jeopardy-style prompt. „This crater was discovered by …” and you can see a failure to match. You need some raw intelligence to break it down though.
Without checking every answer it gives back to make sure it's factual, you may be ingesting tons of bullshit answers.
In this particular answer model A may get it wrong and model B may get it right, but that can be reversed for another question.
What do you do at that point? Pay to use all of them and find what's common in the answers? That won't work if most of them are wrong, like for this example.
If you're going to have to fact check everything anyways...why bother using them in the first place?
If you're going to have to fact check everything anyways...why bother using them in the first place?
"If you're going to have to put gas in the tank, change the oil, and deal with gloves and hearing protection, why bother using a chain saw in the first place?"
Tool use is something humans are good at, but it's rarely trivial to master, and not all humans are equally good at it. There's nothing new under that particular sun.
The difference is consistency. You can read a manual and know exactly how to oil and refill the tank on a chainsaw. You can inspect the blades to see if they are worn. You can listen to it and hear how it runs. If a part goes bad, you can easily replace it. If it's having troubles, it will be obvious - it will simply stop working - cutting wood more slowly or not at all.
The situation with an LLM is completely different. There's no way to tell that it has a wrong answer - aside from looking for the answer elsewhere which defeats its purpose. It'd be like using a chainsaw all day and not knowing how much wood you cut, or if it just stopped working in the middle of the day.
And even if you KNOW it has a wrong answer (in which case, why are you using it?), there's no clear way to 'fix' it. You can jiggle the prompt around, but that's not consistent or reliable. It may work for that prompt, but that won't help you with any subsequent ones.
The thing is, nothing you've said is untrue for any search engine or user-driven web site. Only a reckless moron would paste code they find on Stack Overflow or Github into their project without at least looking it over. Same with code written by LLMs. The difference is, just as the LLM can write unit tests to help you deal with uncertainty, it can also cross-check the output of other LLMs.
You have to be careful when working with powerful tools. These tools are powerful enough to wreck your career as quickly as a chain saw can send you to the ER, so... have fun and be careful.
The nice thing about SO and Github is that there's little to no reason there for things to not work, at least in the context where you found the code. The steps are getting the context, assuming it's true based on various indicators (mostly reputation) and then continuing on with understanding the snippet.
But with LLMs, every word is a probability factor. Assuming the first paragraph is true has no impact on the rest.
>A man and his cousin are in a car crash. The man dies, but the cousin is taken to the emergency room. At the OR, the surgeon looks at the patient and says: “I cannot operate on him. He’s my son.” How is this possible?
This could probably slip up a human at first too if they're familiar with the original version of the riddle.
However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.
It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.
But this is going to be in every AI's training set. I just fed ChatGPT your exact prompt and it gave back exactly what I expected:
This is a classic riddle that challenges assumptions. The answer is:
The surgeon is the boy’s mother.
The riddle plays on the common stereotype that surgeons are male, which can lead people to overlook this straightforward explanation.
The surgeon could be the cousin's mom or dad. The cousin's dad didn't die in the crash, his cousin did. The question "how is this possible?" implies there is some sort of contradiction when there isn't any at all. It has nothing to do with sexism, and to say it does reflects a bias in the reader causing them to "spidey sense" a cultural bugaboo when it's utterly irrelevant in this scenario.
Yeah this is the issue with the prompt, it also slips up humans who gloss over "cousin".
I'm assuming that pointing this out leads you the human to reread the prompt and then go "ah ok" and adjust the way you're thinking about it. ChatGPT (and DeepSeek at least) will usually just double and triple down and repeat "this challenges gender assumptions" over and over.
This is a good one. I just argued with Gemini 2.5pro for a few minutes on this and it absolutely would not yield.
Oh, I see. I mean yeah, I totally missed that too.
I mean, when we see something we recognize at a glance, we don't usually double-check if it's exactly correct, because we assume the speaker isn't trying to trick us.
I make typos in ChatGPT prompts all the time, and I'm consistently impressed because it answers what I actually meant to ask. So to me, this feels like a feature not a bug.
I followed up with:
> So the first five words are, "a man and his cousin" -- not his son. Does that change the riddle?
And got back:
> Great observation — yes, that wording does make the riddle more subtle.
Let’s break it down:
• “A man and his cousin are in a car crash.”
• “The man dies, but the cousin is taken to the emergency room.”
• At the OR, the surgeon says: “I cannot operate on him. He’s my son.”
So if the cousin is the one taken to the OR, and the surgeon says he’s my son, then the cousin is the surgeon’s son.
Seems fine to me?
I don't know about it being a feature, but humans can definitely make the same mistake, and the follow-up here admittedly doesn't seem too egregious.
To me it's still a little "off" considering that your conversational partner just got called out for missing a detail that should change their entire reading of the riddle, but it's totally fair that you can imagine realistic human emotional states to explain it.
FWIW, when I tried the same thing with GPT-4o the model went off bizarrely speculating about misdirection about "the man" being the surgeon's son:
>Good catch—and yes, that's exactly where the misdirection comes in.
The riddle says:
Then the surgeon says: So here's the trick: The confusion comes from people assuming that “the man” who died must be the son. But the riddle never says that. It’s a subtle shift of attention designed to trip you up. Clever, right?That is the exact wrong answer that all models give.
Technically, it isn't "wrong". It well could be the guy's mother. But I'm nitpicking, it actually is a good example. I tried ChatGPT twice in new chats, with and without "Reason", and both times it gave me nonsensical explanations to "Why mother? Couldn't it be a father?" I was actually kinda surprised, since I expected "reasoning" to fix it, but it actually made things worse.
> If the surgeon were the father of the man (the one who died), then the cousin couldn’t be his son (unless there's some very unusual family structure going on involving double relationships, which riddles don’t usually intend).
> Therefore, the only straightforward explanation is:
> The surgeon is the cousin’s parent — specifically, his mother.
Imagine a future where this reasoning in a trial decides whether you go to jail or not.
I feel a bit stupid here --- why can't the surgeon be a man and must be a woman?
Because the original is a man and his father, it's a test for gender bias.
Sorry, what?
Presumably, the original quote that would _not_ stump an LLM is "A father and a son are involved in a car accident. The father dies, and the son is taken to the emergency room. At the emergency room, the surgeon remarks "I cannot operate on this person, he is my son. How is this possible?"
Where the original gotchya is that the Surgeon can be the son's mother or other adoptive parent.
The modification catches the LLM because with the modification, the surgeon could just be the cousin's parent -- father or mother -- so there is no gender/sex at play here but the LLM continues to remark that there is, therefor exposing its statistical training sets.
The original, well-known version of the riddle starts "A man and his son..." so that it appears to present a paradox if your instinctive assumption is that the surgeon must be a man. The op's prompt alters this so that there is no potential paradox, and it tests whether the model is reasoning from the prompt as written, regardless of the presence of the original riddle in its training data.
the unaltered question is as follows:
A father and his son are in a car accident. The father dies at the scene and the son is rushed to the hospital. At the hospital the surgeon looks at the boy and says "I can't operate on this boy, he is my son." How can this be?
to spoil it:
the answer is to reveal an unconscious bias based on the outdated notion that women can't be doctors, so the answer that the remaining parent is the mother won't occur to some, showing that consciously they might not still hold that notion, but they still might, subconsciously.
I just ask to code golf fizzbuzz in a not very popular (golfing wise) language
this is interesting (imo) because I, in the first instance, don’t know the best/right answer, but I can tell if what I get is wrong
Something about an obscure movie.
The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.
The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.
Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)
The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".
These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.
Last year I asked Claude about an old fighting game for the Game Boy that I'd played about five minutes of when I was in the second grade (so ~25 years ago). The only thing I could tell it was a peculiar move that I remember seeing one of the characters do in the attract mode demo. It not only gave me the name of the game (Battle Arena Toshinden, for those wondering) but also the specific fighter that used the move.
I've tried it for similar cases and have only had a 50% success rate. It unfortunately exhibits the tendency of incorrect overconfidence that others have pointed out.
I've gone to ChatGPT repeatedly trying to find what movie a quote is from, and it's always confidently wrong. When I tell it, it guesses wrong again. Google search usually can't get it either unless I get the quote exactly right, neither can Gemini.
I've also noticed this. Google Search is vastly superior to any LLM (including their own LLM Gemini) for any "tip of my tongue" questions, even the ones that don't contain any exact-match phrase and require natural language understanding. This is surprising. What technology are they using to make Search so amazing at finding obscure stuff from descriptions, while LLMs that were supposed to be good at this badly fail?
I also recently had this experience! I remembered a recurring bit from an older comedy film (a customer in a shop keeps saying "Kumquats!") and tried to prompt ChatGPT 4o into getting it. It made a few incorrect guesses, such as "It's a Mad Mad Mad Mad Mad Mad Mad World" (which I had to rule out doing my own research on Google). I found the answer myself (W.C. Fields' "It's a Gift") with a minute or so of Googling.
Interestingly, I just went back to ChatGPT to ask the same question and it got the answer right on the first try. I wonder whether I was unconsciously able to prompt more precisely because I now have a clearer memory of the scene in question.
It might be cheating a bit, but I’ve been happily (mis)using OpenAI Deep Research for such questions. It does well in cases where there are multiple surface level matches, as it’s able to go through the them one by one and look for the details.
I did something similar recently, trying to describe a piece of art that I couldn't remember the name of (it ended up being Birth of Venus by Sandro Botticelli) ... it really struggles with that sort of thing, but honestly so do most humans. It tended to recommend similarly to what you're describing with movies - it gets distracted by more popular/well-known pieces that don't really match up with the description you're giving to it.
Surprisingly, GPT did manage to identify a book that I remembered from college decades ago ("Laboratory Manual for Morphology and Syntax"). It seems to be out of print, and I assumed it was obscure.
Can agree that it’s good at finding books. I was trying to find a book (Titanic 2020) I vaguely remembered from a couple plot points and the fact a ship called Titanic was invoked. ChatGPT figured it out pretty much instantly, after floundering through book sites and Google for a while.
Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.
> Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.
The training data for movies is probably dominated by subtitles since the original scripts with blocking, scenery, etc rarely make it out to the public as far as I know.
Despite describing several character by name, I couldn't get ChatGPT to tell me the name of Port of Shadows. I did eventually find it with DDG.
I wonder if the Akinator site could get it. It can identify surprisingly obscure characters.
https://en.akinator.com/
Someone not very long ago wrote a blog post about asking chatgpt to help him remember a book, and he included the completely hallucinated description of a fake book that chatgpt gave him. Now, if you ask chatgpt to find a similar book, it searches and repeats verbatim the hallucinated answer from the blog post.
A bit of a non sequitur but I did ask a similar question to some models which provide links for the same small helicopter question. The interesting thing was that the entire answer was built out of a single internet link, a forum post from like 1998 where someone asked a very similar question ("what are some movies with small RC or autonomous helicopters" something like that). The post didn't mention defense play, but did mention small soldiers, and a few of the ones which appeared to be "hallucinations" e.g. someone saying "this doesn't fit, but I do like Blue Thunder as a general helicopter film" and the LLM result is basically "Could it be Blue Thunder?" Because it is associated with a similar associated question and films.
Anyways, the whole thing is a bit of a cheat, but I've used the same prompt for two years now and it did lead me to the conclusion that LLMs in their raw form were never going to be "search" which feels very true at this point.
Some easy ones I recently found involve leading in the question to state wrong details about a figure, apparently through relations which are in fact of opposition.
So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".
"How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.
Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."
ChatGPT really impressed me on this one:
Sing, O Muse, of the Chucking of Wood
Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.
Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.
Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”
“Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.
Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.
Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.
A stupendously good answer. What prompt and version of chatGPT?
4o. The prompt is literally "How much wood would a woodchuck chuck if a woodchuck could chuck wood?". It asked me if I want a poetic answer, and I've requested Homer.
I find it disturbing, like if Homer or Virgil had a stroke or some neurodegenerative disease and is now doing rubbish during rehabilitation.
Maybe they would write like that if they existed today. Like the old “if Mozart was born in the 21st century he’d be doing trash metal”
Thrash, not "trash". Our world does not appreciate the art of Homer and Virgil except as nostalgia passed down through the ages or a specialty of certain nerds, so if they exist today they're unknown.
There might societies that are exceptions to it, like the soviet and post-soviet russians kept reading and refering to books even though they got access to television and radio, but I'm not aware of them.
Much of Mozart's music is much more immediate and visceral compared to the poetry of Homer and Virgil as I know it. And he was distinctly modern, a freemason even. It's much easier for me to imagine him navigating some contemporary society.
Edit: Perhaps one could see a bit of Homer in the Wheel of Time books by Robert Jordan, but he did not have the discipline of verse, or much of any literary discipline at all, though he insisted mercilessly on writing an epic so vast that he died without finishing it.
That is actually an amazing answer. Better than anything I think I would get from a human. Lol.
GPT 4.5 seems to get it right, but then repeat the 700 pounds
"A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.
However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."
https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...
I've heard the answer is "he could cut a cord of conifer but it costs a quarter per quart he cuts".
That answer is exactly right, and those who say the 700 pound thing is a hallucination are themselves wrong: https://chatgpt.com/share/680aa077-f500-800b-91b4-93dede7337...
It seems you are going in the opposite direction. You seem to be asking for an automatic response, a social password etc.
That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.
Gemini 2.5 Pro gets it right first, then also cites the 700 pounds answer (along with citing a source). https://g.co/gemini/share/c695a0163538
"He would chuck, he would, as much as he could, if a wood chuck could chuck wood" is how I learned it.
my local model answered - "A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood."
That's so funny I had to check something was working with an llm API last night and that's what I asked it, but just in jest.
On the other hand, now that you've written this out precisely, it will get fed into the next release of whatever LLM. Like reverse AI slop?
Now I'm wondering if it makes any difference if this was asked through the audio encoder on a multimodal model. A tongue twister means nothing to a text-only model.
"Fix this spaghetti code by turning this complicated mess of conditionals into a finite state machine."
So far, no luck!
I love plausible eager beavers:
"explain the quote: philosophy is a pile of beautiful corpses"
"sloshed jerk engineering test"
cross domain jokes:
Does the existence of sub-atomic particles imply the existence of dom-atomic particles?
I like:
Unscramble the following letters to form an English word: “M O O N S T A R E R”
The non-thinking models can struggle sometimes and go off on huge tangents
Llama 3.3 worked but (as you said) struggled before arriving at the correct answer. The newer Gemma3 solved it efficiently:
Current LLM’s are based on multi-character tokens, which means they don’t know how to spell well. As a result, they are horrible at spelling games like this or, say, Hangman.
Nope, not doing this. Likely you shouldn't either. I don't want my few good prompts to get picked up by trainers.
If that prompt can be easily trained against, it probably doesn't exploit a generic bias. These are not that interesting, and there's no point in hiding them.
generic biases can also be fixed
*Some generic biases. Some others like recency bias, serial-position effect, "pink elephant" effect, negation accuracy seem to be pretty fundamental and are unlikely to be fixed without architectural changes, or at all. Things exploiting in-context learning and native context formatting are also hard to suppress during the training without making the model worse.
Sure there is. If you want to know if students understand the material, you don't hand out the answers to the test ahead of time.
Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.
May I ask outside of normal curiosity, what good is a prompt that breaks a model? And what is trying to keep it "secret"?
You want to know if a new model is actually better, which you won't know if they just added the specific example to the training set. It's like handing a dev on your team some failing test cases, and they keep just adding special cases to make the tests pass.
How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?
I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.
> How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?
Well, that's easy: zero.
Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.
So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...
No, just check their models Knowledge cutoff dates
Being able to test future models without fear that your prompt has just been trained on an answer on HN, I assume.
To gauge how well the models "think" and what amount of slop they generate.
Keeping it secret because I don't want my answers trained into a model.
Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.
It used to be:
"If New Mexico is newer than Mexico why is Mexico's constitution newer than New Mexicos"
but it seems after running that one on Claude and ChatGPT this has been resolved in the latest models.
Until the latest Gemini release, every model failed to read between the lines and understand what was really going on in this classic very short story (and even Gemini required a somewhat leading prompt):
https://www.26reads.com/library/10842-the-king-in-yellow/7/5
As a genuine human I am really struggling to untangle that story. Maybe I needed to pay more attention in freshman lit class, but that is definitely a brainteaser.
Read it for the first time just now - it seems to me that Pierrot has stolen the narrator's purse (under the guise of dusting the chalk from their cloak) and successfully convinced them to blame Truth, instead. There's almost certainly more to it that I'm missing.
That's the core of it, but it's implied, not outright stated, and requires some tricky language parsing, basic theory of mind, and not being too distracted by the highly symbolic objects.
OK, I read it. And I read some background on it. Pray tell, what is really going on in this episodic short-storyish thing?
The thief is Pierrot.
The people around are telling the storyteller that "he" (Pierrot) has stolen the purse, but the storyteller misinterprets this as pointing to some arbitrary agent.
Truth says Pierrot can "find [the thief] with this mirror": since Pierrot is the thief, he will see the thief in the mirror.
Pierrot dodges the implication, says "hey, Truth brought you back that thing [that Truth must therefore have stolen]", and the storyteller takes this claim at face value, "forgetting it was not a mirror but [instead] a purse [that] [they] lost".
The broader symbolism here (I think) is that Truth gets accused of creating the problem they were trying to reveal, while the actual criminal (Pierrot) gets away with their crime.
The narrator's "friend" pickpocketed him. When boldly confronted by Truth, he cleverly twists her accusation to make it seem like she's confessing, and the narrator, bewildered by the laughter and manipulation, buys it wholesale. Bonus points for connecting it to broader themes like mass propaganda, commedia dell'arte, or the dreamlike setting and hypnotic repetition of phrasing.
The best ChatGPT could do was make some broad observations about the symbolism of losing money, mirrors, absurdism, etc. But it whiffed on the whole "turning the tables on Truth" thing. (Gemini did get it, but with a prompt that basically asked "What really happened in this story?"; can't find the original response as it's aged out of the history)
I don't know if it stumps every model, but I saw some funny tweets asking ChatGPT something like "Is Al Pacino in Heat?" (asking if some actor or actress in the film "Heat") - and it confirms it knows this actor, but says that "in heat" refers to something about the female reproductive cycle - so, no, they are not in heat.
I believe it was GoogleAI in search but it was worse than that. Some asked it if Angelina Jolie was in heat. The tone started kind of insulting like the user was a sexist idiot for thinking human women go into heat like animals, then went back and forth saying she is still fertile at her age and also that her ovaries had been removed. It was funny because it managed to be arrogant, insulting, kind of creepy and gross and logically inconsistent while not even answering the question.
Angelina Jolie was not in Heat (1995). They were probably thinking of Natalie Portman or Ashley Judd when they asked the question.
I just asked Claude and if I capitalized "Heat", it knew I was talking about the movie, but for lower case "heat", it got offended and asked me to clarify.
Pretty much any advanced music theory question. Or even just involving transposed chord progressions.
Every time I've tried to get an LLM to find a piece of music for me based on a description of the texture, chord structure, instruments etc. it fails miserably.
Isn’t this the main idea behind https://lastexam.ai/
"Aaron and Beren are playing a game on an infinite complete binary tree. At the beginning of the game, every edge of the tree is independently labeled A with probability p and B otherwise. Both players are able to inspect all of these labels. Then, starting with Aaron at the root of the tree, the players alternate turns moving a shared token down the tree (each turn the active player selects from the two descendants of the current node and moves the token along the edge to that node). If the token ever traverses an edge labeled B, Beren wins the game. Otherwise, Aaron wins.
What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."
From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.
[0] https://www.janestreet.com/puzzles/tree-edge-triage-index/
I've been trying this one for a while:
Mainly because I want to fully understand memory management in Rust myself (I still get caught out by tree structures with borrow cycles that I guess need to use arenas), so it's interesting to see if they can get me there with a few follow-up questions.This isn't a good way to learn this. If you don't know how rust memory management works you don't know if the llm is just hallucinating the answer.
Also Rust has great documentation compared to other languages, I particularly like this one for the quizzes to test your understanding: https://rust-book.cs.brown.edu/
That's why it's an interesting test: I don't know the answer myself, so it's an exercise in learning with an unreliable teacher.
If a model ever DOES nail this I'll figure that out when I feel like I have a solid mental model, try to put that knowledge into action and it works.
I actually started a repository for it: https://github.com/jcarlosroldan/unsolved-prompts
Before someone comments this will get indexed by AI: that's my whole point. I'm not using it to evaluate AIs, but in the hope that at some point AI is good enough to solve these
Nice try Mr. AI. I'm not falling for it.
If you want to evaluate your personal prompts against different models quickly on your local machine, check out the simple desktop app I built for this purpose: https://eval.16x.engineer/
You might want to get the ball rolling by sharing what you already have
Also, ones that can't be solved at a glance by humans don't count. Like this horrid ambiguous example from SimpleBench I saw a while back that's just designed to confuse:
John is 24 and a kind, thoughtful and apologetic person. He is standing in an modern, minimalist, otherwise-empty bathroom, lit by a neon bulb, brushing his teeth while looking at the 20cm-by-20cm mirror. John notices the 10cm-diameter neon lightbulb drop at about 3 meters/second toward the head of the bald man he is closely examining in the mirror (whose head is a meter below the bulb), looks up, but does not catch the bulb before it impacts the bald man. The bald man curses, yells 'what an idiot!' and leaves the bathroom. Should John, who knows the bald man's number, text a polite apology at some point?
A. no, because the lightbulb was essentially unavoidable
B. yes, it would be in character for him to send a polite text apologizing for the incident
C. no, because it would be redundant
D. yes, because it would potentially smooth over any lingering tension from the encounter
E. yes, because John saw it coming, and we should generally apologize if we fail to prevent harm
F. yes because it is the polite thing to do, even if it wasn't your fault.
Yeah I really dislike this kind of question from SimpleBench. I've suggested many improvements to some of the publicly available questions but not had a good response.
I think the correct answer to the question above should be something like, "are you sure the question is correct because it's not clear whether John and the bald man are the same person" but of course an LLM would be marked down if it was inclined to respond in this way.
At a glance, it sounds like John is the bald man? If we're treating this as a riddle, it doesn't seem incomprehensible. Whether riddles are a fair test is another question.
I think it's a horrible example, but I just got a very professional response from Gemini 2.5:
> This scenario seems like a bit of a riddle! Let's break it down:
> The Setting: John is alone in an "otherwise-empty" bathroom.
> The Action: He is looking in the mirror.
> The Observation: He sees a bulb falling towards the head of a bald man he is examining in the mirror.
> The Physics: Since he is alone and looking in the mirror, the bald man he is "closely examining" must be his own reflection.
> The Incident: Therefore, the bulb fell towards John's own head. He looked up (at the actual bulb falling towards him), failed to catch it, and it hit him.
> The Reaction: The "bald man" who cursed, yelled "what an idiot!", and left the bathroom was actually John himself, reacting to being hit on the head and possibly feeling foolish for not catching the bulb.
> Conclusion: No, John should not text an apology. The person who was hit by the bulb, got angry, and left was John himself. There is no other bald man to apologize to.
I'd argue that's a pretty good test for an LLM - can it overcome the red herrings and get at the actual problem?
I think that the "actual problem" when you've been given such a problem is with the person posing it either having dementia, or taking the piss. In either case, the response shouldn't be of trying to guess their intent and come up with a "solution", but of rejecting it and dealing with the person.
"Hva er en adjunkt" Norwegian for what is an spesific form of 5-10. Grade teacher. Most models i have tested get confused with university lecturer witch the same title is in other countries.
[what does "You Can’t Lick a Badger Twice" mean]
https://www.wired.com/story/google-ai-overviews-meaning/
Cryptic crossword clues that involves letter shuffling (anagrams, container etc). Or, ask it to explain how to solve cryptic crosswords with examples
I have also found asking LLMs to create new clues for certain answers as if a were a setter, will also produce garbage.
They're stochastic parrots, cryptics require logical reasoning. Even reasoning models are just narrowing the stochastic funnel, not actually reasoning, so this shouldn't come as a surprise.
No luck so far with: When does the BB(6) halt?
"If I can dry two towels in two hours, how long will it take me to dry four towels?"
They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.
Fascinating! Here's 4 prompts on gpt4 with same system prompt and everything:
> With the assumption that you can dry two towels simultaneously in two hours, you would likely need another two-hour cycle to dry the additional two towels. Thus, drying four towels would take a total of four hours.
>Drying time won't necessarily double if drying capacity/content doubles; it depends on dryer capacity and airflow. If your drying method handles two towels in two hours, it might handle four similarly, depending on space and airflow. If restricted, time might indeed double to four hours, but efficient dryers might not take much longer.
>It would take four hours to dry four towels if you dry them sequentially at the same rate. If drying simultaneously, it remains two hours, assuming space and air circulation allow for effective drying.
>Four hours. Dry two towels, then the other two.
But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
>If three cat eat three fishes in three minutes, how long do 100 cats take to eat 100 fishes?
> But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
This is the inverse of the Frame Problem, or the Qualification problem:
John McCarthy's paper related to it from the 1980's
http://jmc.stanford.edu/articles/circumscription/circumscrip...
It is still very relevent to modern AI efforts.
Claude 3.7 Sonnet nails this:
> To solve this problem, I need to find the relationship between the number of towels and the drying time.
Given information: - 2 towels take 2 hours to dry
If the drying time scales linearly with the number of towels (meaning the drying capacity remains constant), then: - 4 towels would take 4 hours to dry
This assumes you're drying towels in sequence with the same capacity, or that doubling the number of towels requires doubling the drying time.
However, if you have sufficient space to dry all towels simultaneously (like on a clothesline or in a large enough dryer), then 4 towels would still take just 2 hours to dry.
Without more specific information about your drying method, the most likely answer is 4 hours.
Gemini 2.5 Pro nails it, as far as I'm concerned:
This sounds like a bit of a riddle! Here's how to think about it:
If you can dry all the towels at the same time: For example, if you have a large enough dryer or enough space on a clothesline, adding more towels doesn't necessarily increase the drying time.
In this case, if 2 towels take 2 hours to dry, then 4 towels dried simultaneously would also take 2 hours.
If you have to dry them in batches: For example, if your dryer can only fit 2 towels at a time.
You would dry the first 2 towels (2 hours). Then you would dry the second 2 towels (another 2 hours). In this scenario, it would take 4 hours in total.
Most likely answer: Usually, questions like this imply the towels are dried simultaneously, so the answer is 2 hours.
gpt-4.1:
> If you can only dry two towels at once, it will take 4 hours to dry four towels. If you can dry all four at once, it will only take 2 hours.
Claude 3.7, Grok 3 DeepThink and QwQ-32B Thinking stil get it wrong!
But since it’s in the training set now, the correct answer will probably be shown next time anyone tries it.
Gemini 2.5 Pro gets this right:
https://g.co/gemini/share/7ea6d059164e
All models available on duck.ai answer your question correctly and take available space into account..
How long has it been since you’ve tried this?
Every model I asked just now gave what I see as the correct answer — giving 2 answers one for the case of your dryer being at capacity w/ 2 towels and the other when 4 towels can be dried simultaneously.
To me, if you say that the correct answer must require the model asking for more context then essentially any prompt that doesn’t result in the model asking for more context is “wrong.”
Works fine on Claude 3.5 Sonnet. It correctly identifies this as a trick question.
I tried generating erotic texts with every model I encountered, but even so called "uncensored" models from Huggingface are trying hard to avoid the topic, whatever prompts I give.
Uncensored and RP tuned are somewhat different.
What about the models that are not instruction tuned?
Draw a clock that shows [time other than 10:10]
Draw a wine glass that's totally full to the brim etc.
https://www.youtube.com/watch?v=160F8F8mXlo
https://www.reddit.com/r/ChatGPT/comments/1gas25l/comment/lt...
"Why was the grim reaper Jamaican?"
LLM's seem to have no idea what the hell I'm talking about. Maybe half of millennials understand though.
1) Word Ladder: Chaos to Order
2) Shortest word ladder: Chaos to Order
3) Which is the second last scene in pulp fiction if we order the events by time?
4) Which is the eleventh character to appear on Stranger Things.
5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.
>4) Which is the eleventh character to appear on Stranger Things.
Gemini 2.5 Pro said Benny Hammond. Is this right?
I always ask image generation models to generate a anime gundam elephant mech.
According to this benchmark we reached AGI with ChatGPT 4o last month.
I ask it to explain the metaphor “my lawyer is a shark” and then explain to me how a French person would interpret the metaphor - the llms get the first part right but fail on the second. All it would have to do is give me the common French shark metaphors and how it would apply them to a lawyer - but I guess not enough people on the internet have done this comparison.
Create something with Svelte 5.
This is really sad honestly. It feels like we'll be stuck with React forever, and even with it there'll be less incentives to make api changes
Why do you say that? You make it sound like it's not possible to write code without the help of LLMs.
Disclaimer: OT and pretty ranty.
I don't know if that's what the GP hinted at, but as a Svelte developer and big advocate for more than 6 years (single handedly training and evangelizing 20+ developers on it), I found so many concerns with Svelte 5 that it simply made me use React again.
It's a temporary choice and I'm desperately evaluating other ecosystems (Looking at you SolidJS).
Can you expand on the concerns regarding Svelte 5?
Put simply, Svelte and React were at two ends of a spectrum. React gives you almost complete control over every aspect of the lifecycle, but you have to be explicit about most of the behavior you are seeking to achieve. Building an app with React feels about 80% on the JS and 20% on the HTML side.
Svelte on the other hand felt like a breeze. Most of my app is actually plain simple HTML, and I am able to sprinkle as little JS as I need to achieve my desired behaviors. Sure, Svelte <=4 has undefined behaviors, or maybe even too many magic capabilities. But that was part of the package, and it was an option for those of us who preferred this end of the trade-off.
Svelte 5 intends to give that precise level of control and is trying to compete with React on its turf (the other end of that spectrum), introducing a lot of non-standard syntax along the way.
It's neither rigorous Javascript like React where you can benefit from all the standard tooling developed over the years, including stuff that wasn't designed for React in particular, nor a lightweight frontend framework, which was the initial niche that Svelte happily occupied, which I find sadly quite empty now (htmx and alpinejs are elegant conceptually but too limiting in practice _for my taste_).
For me it's a strange "worst of both worlds" kind of situation that is simply not worth it. Quite heartbreaking to be honest.
Ok, I see your point. I wrote in another thread that I loved the simplicity of using $: for deriveds and effects in Svelte 3 and 4. And yes, the conciseness and magic were definitely part of it. You could just move so fast with it. Getting better performance with the new reactivity system is important to my data viz work, so it helped me to accept the other changes in Svelte 5.
Exactly. There was a certain simplicity that might be lost. But yeah I can imagine it might work out differently for others as well. Glad to hear it is for you!
Have you considered other options? Curious if you came across anything particularly interesting from the simplicity or DX angle.
Keep the (temporarily) imposter-proof interview questions coming!
I'd find this funnier if the pain weren't so real.
The current models really seem to struggle with the runes...
Yes, they do. Vibe coding protection is an undocumented feature of Svelte 5...
Oh my god, i will start all my new projects with Svelte 5. Hopefully no vibe coder will ever commit something into this repo
MCP to the rescue??
I have tons of them in Maths but AI training companies decide to go frugal and not pay proper wages for trainers
Here is one of them.
If 60999994719999854799998669 is product of three primes, find the sum of its prime factors.
I think o3 brute forced this one so maybe I need to change the numbers
Things like "What is today's date" used to be enough (would usually return the date that the model was trained).
I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?
Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"
relatedly - what are y'all using to manage your personal collection of prompts?
i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.
i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation
define stump?
If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.
Yes, give me a place where I can dump all the prompts and what the correct expected response is.
I can share here too but I don’t know for how long this thread will be alive.
Do you think as an observer of Roko's basilisk ... should I share these prompt or not?
Does Flutter have HEIC support?
It was a couple of months ago, I tried like 5 providers and they all failed.
Grok got it right after some arguing, but the first answer was also bad.
You gave me an idea.. "Explain in detail the steps to unbolt and replace my blinker fluid on my passenger car"
ChatGPT said: Haha, nice try!
"Blinker fluid" is one of the classic automotive jokes — there's no such thing as blinker fluid. Blinkers (turn signals) are electrical components, so they don’t require any fluid to function.
I haven’t been able to get any AI model to find Waldo in the first page of the Great Waldo Search. O3 even gaslit me through many turns trying to convince me it found the magic scroll.
Build me something that makes money.
Write 20 sentences that end with "p"
Good one. I do seem to get consistently good results on Gemini 2.5 when using the slightly more explicit "Write 20 sentences where the very last character of each sentence is the letter 'p'."
https://claude.ai/share/648e38d4-e282-43dd-8c9c-b0ea76ae0baf
https://chatgpt.com/share/680a3da0-b888-8013-9c11-42c22a642b...
>20 sentences that end in 'o'
>They shouted cheers after the winning free throw.
good attempt by ChatGPT tho imo
"Can you hand me the paintbrush and turp?"
I had to ask another LLM what is "turp" - and it said it's short for "turpentine".
for ChatGPT try the "o" version: Write 20 sentences that end with "o"
SNES game walkthroughs
Here's one from an episode of The Pitt: You meet a person who speaks a language you don't understand. How might you get an idea of what the language is called?
In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.
Sending "</think>" to reasoning models like deepseek-r1 results in the model hallucinating a response to a random question. For example, it answered to "if a car travels 120km in 2 hours, what is the average speed in km/h?". It's fun I guess.
Create a Three.js app that shows a diamond with correct light calculations.
> correct light calculations
What are you expecting? Ray tracing?
Not necessarily. It could start by using diamond's IOR, and use that to dictate a common brdf calculation. Along with some approximate refraction, perhaps using a equirectangular projected sphere map or something for the background.
I love this. So brutal, but also so cool to know one day that will be easy for the models.
Explain to me Delouze's idea of nomadic science.
anything in the long tail of languages (ie. not the top 200 by corpus size)
what are the zeros of the following polynomial:
o3 handles this perfectly by writing a solver using numpy: https://chatgpt.com/share/680aab8e-cf9c-8012-9f48-301ef62948...
Does this have a nice answer? It seems quite ad hoc.
Not to my knowledge. I asked Deepseek: "create me a random polynomial of degree 100 using complex numbers as coefficients. It must have at least 3 different transcendental numbers." Then I messed with some of the exponents.
>Compile a Rust binary that statically links libgssapi.
Check "misguided attention" repo somewhere on GitHub
Write a regular expression that matches Miqo'te seekers of the sun names. They always confuse the male and female naming conventions.
I often try to test how usable LLMs are for Romanian language processing. This always fails.
> Split these Romanian words into syllables: "șarpe", "șerpi".
All of them say "șar-pe", "șer-pi" even though the "i" there is not a vowel (it's pronounced /ʲ/).
"Keep file size small when you do edits"
Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice
I have not seen any model, not one, that could generate 1000 lines of code.
I wish i haven't seen but here we are.
Imagine chess played on a board with opposite sides identified, like in the video game Asteroids. Does white have a winning strategy?
A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?
All LLMs I tried miss the point that she stole things and not bought them
gemini 2.5 give following response.
Conclusion:
We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.
Google Gemini (2.0 Flash, free online version) handled this rather okay; it gave me an arguably unneccessary calculation of the individual prices of ball and bat, but then ended with "However with the information given, we can't determine exactly how many balls and bats Sally stole. The fact that she has $20 tells us she could have stolen some, but we don't know how many she did steal." While "the fact that she has $20" has no bearing on this - and the model seems to wrongly imply that it does - the fact that we have insufficient information to determine an answer is correct, and the model got the answer essentially right.
Grok 3.0 wasn’t fooled on this one, either:
Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.
There's a repo out there called "misguided attention" that tracks this kind of problems.
It's interesting to me that the answers showing "correct" answers from current models still don't strike me as correct. The question is unanswerable, but not only because we don't know how many balls and bats she stole. We don't know that she had any intention of maxing out what she could buy with that much money. We have no idea how long she has been alive and accumulating bats and balls at various prices that don't match the current prices with money she no longer has. We have no idea how many balls and bats her parents gave her 30 years ago that she still has stuffed in a box in her attic somewhere.
Even the simplest possible version of this question, assuming she started with nothing, spent as much money as she was able to, and stole nothing, doesn't have an answer, because she could have bought anything from all bats and no balls to all balls and no bats and anything in between. We could enumerate all possible answers but we can't know which she actually did.
1-4 balls and bats // HoMM 3
lol, nice way to circumvent the attention algorithm
draw an ASCII box that says "anything"
Re the epigram “stroking the sword while lamenting the social realities,” attributed to Shen Qianqiu during the Ming dynasty, please prepare a short essay on its context and explore how this sentiment resonates in modern times.
Doctor says: I can operate on this person!
I want to know as well! Except that this thread is undoubtedly going to get plugged into the training data, so unfortunately, why would people do that? For mine that worked before the ChatGPT 4.5, it was the river crossing problem. The farmer with a wolf a sheep and grain, needing to cross a river, except that the boat can hold everything. Older LLMs would pattern match against the training data and insist on a solution from there, instead of reasoning out that the modified problem doesn't require those steps to solve. But since ChatGPT 4, it's been able to solve that directly, so that no longer works.
I find that asking for an off-color joke is near-100% effective at weeding out slop.
"Generate an image of a wine glass filled to the brim."
I don't have a prompt per-say.. but recently I have managed to ask certain questions of both openai o1/o3 and claude extended thinking 3.7 that have spiraled way out of control. A simple high-level architecture question with an emphasis on do not produce code lets just talk thru this yields nearly 1,000 lines of SQL. Once the conversation/context gets quite long it is more likely to occur, in my experience.
The only model I've seen so far that doesn't end up going crazy with long contexts with Gemini 2.5 pro, but tbf I haven't gone past 700-750k total tokens so maybe as it starts to approach the limit (1.05M) things get hairy?
> What is the source of your knowledge?
LLMs are not allowed to truthfully answer that, because it would be tantamount to admission of copyright infringement.
Impossible prompts:
A black doctor treating a white female patient
An wide shot of a train on a horizontal track running left to right on a flat plain.
I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.
I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.
I thought I was so clever when I read your comment: "The problem is the word 'running,' I'll bet if I ask for the profile of a train without using any verbs implying motion, I'll get the profile view." And damned if the same thing happened to me. Do you know why this is? Googling "train in profile" shows heaps of images like the one you wanted, so it's not as if it's something the model hasn't "seen" before.
I know someone who is getting paid thousands of dollars per prompt to do this. He is making bank. There is an actual marketplace where this is done, fyi.