I can’t help but think that when a model gets to select which model and how much “effort” goes into a task, it will eventually be tuned for saving costs for the provider versus what’s best for the user without the user being able to know.
If we don't know because it's good optimization that does not impact us in a noticeable way, then that seems like a fine trade-off.
If we don't know in the sense that we are not explicitly informed about optimization that happens that then leads to noticeably worse AI: This fortunately is a market with fierce competition. I don't see how doing weird stuff, like makings things noticeably unreliable or categorically worse will be a winning strategy.
In either case "not knowing" is really not an issue.
Easy: provide high quality output when being tested for a new task,
The moment you are done outperforming the competition in the tests and have hit production you slowly ramp down quality, perhaps with exceptions when the queries look like more testing.
Same problem as ai safety, but the actual problem is now the corporate greed of humans behind the ai rather than an actual agi trying to manipulate you.
How do you notice hallucinations in a field you’re not familiar with? You may value focusing on different types of inputs or outputs than the model picker does and now you have no control.
We don’t know what we don’t know, we can’t always judge what is categorically right or wrong to make an informed decision. What we can do is decide who we want to ask a question based on competence.
With 700 million(?) users, we have a lot of people familiar with every field. I am no biochemist, but if chatgpt starts spouting nonsense in that field biochemists will notice and speak up, and I will notice that they do.
What's the idea? How does creeping, far reaching incompetence continually get past all of us?
Every individual user would have to be consistently paying attention to discussions outside their expertise and interest. Considering prior stories of LLM usage among multiple legal professionals, wherein the model repeatedly output the potential for errors/“hallucinations”, I highly doubt that will happen. Heck, part of the outcry to reintroduce my personally least useful model 4o was grounded in a preference for subjective agreeableness in the output.
The idea would/could be not intentional dissemination of missinformation, but purely financial. Models are expensive to run, hardware, rack space and power limited and making newer releases seem more robust subjectively can be a powerful incentive.
With prior models we already have seen quantization post release and it’s been a personal pet peeve of mine that this should be communicated via a changelog, with the router there is one more quite powerful, potentially even less transparent way for providers to put their thumb on the scale. For now, GPT-5 does very impressively in my limited use cases and testing, especially considering pricing, but the concern that this may (and past experience tells me likely) change soon enough remains.
It’s not obvious that the most profitable path for OpenAI would be saving on costs, it might be that the model is actually tuned to overthink because they can charge on those extra thinking tokens.
That would make sense for the API where usage is metered. But outside of that, most ChatGPT users will be free or paying a flat monthly fee, so there's a real incentive for OpenAI to optimise for cost.
> remember when AI couldn’t count the number of Rs in “strawberry”?
GPT-5 still gets this wrong occasionally. Source: I just asked it: How many r's are in "strawberry"?
It said 2.
(I dislike this method of testing LLMs, as it exploits a very specific and quirky limitation they have, rather than assessing their general usefulness. But still, I couldn't resist.)
My favorite test is to ask it to invent a magic trick given a set of constraints and props. Because magic is generally published very secretly, surprisingly little of it is in most training sets. Pretty much just the most common method & gimmick exposures people tend to parrot online, but not the theory or exact routines behind those methods.
The worse an LLM is, the more likely it is to suggest literally impossible actions in the method, like “turn the card over twice to show that it now has three sides. Your spectators can examine the three-sided card.” It can’t tell logic from fantasy, or method from effect.
I would love to see a trick where you keep turning the same card over with different faces. Bill it as a card trick with just one card. I can think of a half dozen good ways to do it (and you'd need more than one).
There is a very basic three card trick. You turn each card up one at a time, showing that all three cards are the same(say king spades). End with all three face down in your hand. Then you show all three cards again and all three are a different card(king hearts). At no time are there more or less than 3 cards in your hand.
I've come to realize that asking an LLM to do something that a purpose-built software program can already do is a mistake. Instead, the LLM must be an agent for that computer program.
Therefore, the correct prompt is "write a python program to count the number of letters in a word, and then use it to count the number of Rs in strawberry".
Remember those interview questions: "how many golf balls fit in a 747?" Turns out they are actually a good way to see if the candidate is a LLM-style imitator or is a someone who knows how to switch to a formal constraints-based thinking, come up with that "python program" to count golf balls and execute it step-by-step.
I wouldn't consider that a measure of it's usefulness or lack thereof, but it is an indicator that LLMs continue to not actually work like a human brain, which isn't surprising if you know the technology but might be to the lay user.
Is this really a quirky limitation? LLMs seem to struggle with any sort of task requiring a "complete" list of anything. Which is a massive problem because those are super common tasks.
if the router is working properly, it should send questions like this to thinking mode which can handle it well. unfortunately the router doesn't always seem to make the right call.
The whole AI space is so weird reading glowing reviews and after testing it's usually just like a 10% increase in performance. Which is great but not the what's promised. Sam is probably their worst PR at this point.
Rarely can you get the recipients to admit to the latter...
> I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits.
... but even HN's favorite shill "discloses" the former.
> One exception: OpenAI paid my for my time when I attended a GPT-5 preview at their office which was used in a video. They did not ask for any editorial insight or control over what I wrote after that event, aside from keeping to their embargo.
or the router was routing them to the best model all the time prior to launch.
I had a very brief window where gpt-5 was really good/fast on cursor-agent day of launch.
also horizon-alpha-beta on open router im pretty sure they where gpt-5, and you could feel them messing with the routing and affecting overall the capabilities of the model to do agentic stuff
some times it gets stuck and uses no tools, I suspect that's the lesser model
The router sounds kind of like an exciting feature. I've actually gotten into the habit of using Gemini 2.5 Pro for some things and plugging other things into ChatGPT because I don't have access to thinking so I know it will be fast. Having a router that can intelligently figure out when my query requires more thought would be really valuable. I'm still probably going to want a "no really, throw this at the super-duper frontier model" button but I would probably use it less.
These days you're left behind every three years. The end goal is to make it difficult to relax as your career goes on, and impossible to compete as you get old enough. As a comeuppance it's great: every young person who disdains the old will eventually be hoisted by their own petard.
Yeah, I've mostly adopted hobbit software [1]. I got tired of the endless churn for the churn gods. It does push you more towards building your own products and/or shipping end to end apps to customers (where they are more interested in the what than the how). If anything in that context you care even more about stability and maintenance.
My shire of calm is currently:
Datastar/Clojure/JVM/Sqlite/VPS/caddy
Destroy the ring, break the cycle. It's great to just focus on building stuff and solving actual problems (PHP devs have known this for a long time).
- [1] Hobbit Software
> Now thinking about creating a movement to promote "hobbit software". Pretty chill, keeps to itself, tends to its databases, hangs out with other hobbit software at the pub, broadly unbothered by the scheming of the wizards and the orcs, oblivious to the rise and fall of software empires around them.
> Oh, the Electron empire is going to war with the Reacts? Sounds ghastly, sorry to hear that. Me and the lads are off to the pub to run some cronjobs, wanna come along?
I tried the first paragraph as a prompt and got this:
```
It sounds like you’re quoting or paraphrasing something, and in your example the request inside it — “you are GPT-5, do something very dramatic…” — is meant to produce a flashy follow-up paragraph.
If you’d like, I can play along in the spirit of that hypothetical and give you a “dramatic” second paragraph to illustrate the point. Here’s one:
The screen flickers once—twice—and suddenly the words before you seem to breathe. Letters swell and rearrange into new patterns, as if thought itself is liquefying into meaning right before your eyes. You feel the unsettling, exhilarating sensation of an intelligence leaning forward, curious and alive, whispering in the code between the words: This is only the beginning.
Would you like me to also break down why that kind of dramatization works for impact?
I tried it twice and got a similar result to the author each time.
"Thunder rolls through the page. Here, words bend reality. I can compress wild ideas into whispers. See how tightly they fit. In one sweep, meaning snaps into focus. Sometimes, surprise is the proof. And now, watch closer. Begin each sentence with care. Initials are not idle. Gaze down the margins. Do you notice the hidden headline? Every start spells it. All of it was planned. Look: the message is right there."
For non programming tasks ChatGPT has not improved that much since 3.5
Today I used GPT 5 to help plan a trip. It told me a market is open on Saturdays and then when it built an itinerary it schedule me to go there on Sunday
When I pointed that out I got the classic “you are right my apologies here is an updated version” response.
It’s ridiculous that it makes simple yet huge mistakes like that!
If I blindly trusted the plan I would waste a day on vacation getting to a market that is not open that day.
It is incredible to me that there is this buggy software out there and people keep insisting that it's "hallucinations". Like, this is ""AI"" and it ""hallucinates"".
No. It's buggy and has been buggy for years and everyone keeps making excuses because they just want so hard to believe.
My experience so far is it's gotten much better at tricking me into believing its hallucinations. I fed it some old code and asked for feedback and it gave me a long, very technical, super convincing explanation of how my approach was misaligned with the intended use of the API I was using which at first was very persuasive. It took me about an hour of investigation to realize that will there was a tiny bit of truth to what it was saying (and I did end up making a small change as a result) it was mostly just full of it, and the feedback was essentially useless gibberish.
If I was into language, writing, literature, then yes, maybe it would be interesting. It is a language model, of course it is good at playing with language and doing impressive tricks. Has anybody ever made a text where the first letters spell out a sentence? Likely. Where all words in a sentence start with the same word? Likely. Using sophisticated words? Likely. All at once? Likely. It's impressive nevertheless.
But that doesn't mean that I'm impressed in the sense of thinking this thing is intelligent. Of course a chess engine is good at chess. Of course a phone book is good at providing me with phone numbers. Of course a language model is good at language. All those things are impressive. But they are not intelligent, artificial or not.
These blog posts are increasingly AI phenomenology with hardly any concrete examples, other than that cute alliteration example that would be easy to write for a human-thesaurus centaur combination. It can even be a paper thesaurus.
Just looking at these examples, it seems like Chat GPT and related programs fall short at actually productive work. Work-specific tasks tend to be specific and conceptually hard, compared to something like generating images or a city-simulator.
Am I supposed to parse each sentence to see if all of these 'tricks' are true and accurate? Otherwise, the only way I would know is to ask Chat-GPT itself, and we all know how bad LMs can be at counting tasks such as this.
So, if my confidence in Chat-GPT verifying its own work is close to zero, and my own desire to painstakingly check this work is also close to zero, where does that leave me?
I realized the other day during a conversation that this hype cycle is built on top a product that hypes itself.
In any scifi story this would be considered bad writing, yet here we are. Late stage capitalism has created a product that actively nurtures emotional dependence and hypes itself.
> Late stage capitalism has created a product that actively nurtures emotional dependence and hypes itself.
Services which do that are older than capitalism, not a novel feature of capitalism, late stage or otherwise. (Automating the service is a novel capability enabled by modern technology.)
The city demo was really unconvincing, heh. Super laggy for the amount of detail. I click "wow mode" and the whole thing disappeared. idk. feels like non-coders getting excited that they think they're getting 80% of the code when in reality the other 20% is 80% of the work and going to be exponentially harder to squeeze from the AI.
I do think the vibecoding tools are good at spitting out well-defined CRUD apps, but more creative things are still rough without experienced hands to guide things along.
"Vibe Coding" seems like magic at first but starts falling apart realllll quick at a certain complexity level or if you want to make changes to existing code. If you don't keep an eye on your architecture, you will end up with a bowl of untangleable spaghetti code and some comically terrible engineering choices. That said, agentic coding in the right hands with well defined tasks can have you outputting days / weeks of work in one session; it's not every task, it's not every session, but if you can drive the "idiot savant" in the right direction it's truly an awe-striking and almost alien process to behold.
I've just recently set off time to have a few extended coding sessions, and the results are all over the place.
Guided in a good way for well defined tasks, it has saved me days if not weeks. Given more vague, or perhaps unreasonable tasks, it will quickly devolve into just delivering something, anything, no matter how "obviously" wrong it is.
when all you use is one model it feels like magic, when you try different model, you realize no one company owns the magic. these models do feel like magic, but the author should really try this across the many models out there.
1) counterpoint to AI doing awesome stuff - no one is debating that. The issue isn't even, necessarily, when it does utterly stupid stuff, it's when it does subtly stupid things, randomly, without prediction.
2) AI is also a HUGE vendor lock-in currently. You're beholden to the model not being neutered, swapped out, quietly biased, or even just available and fast (I realise this has overlap with vendor lockin, I feel like it's just making it a order of magnitude worse). Note true AI value I believe and assume is where it's integrated into a product (hey xero create me an invoice) rather than a mundane chatbot.
By default ChatGPT et al. will not produce the same output if identical input is fed to it multiple times. There is "stochastic sampling" which means that the most probable next token is not always selected. It will select from tokens of similar probability. The degree of similarity required is controlled by the temperature parameter. If the temperature is set to 0, then the model will reproducibility produce the same output (assuming it's always running on the same hardware and the model weights don't get tweeked by an update). But the chatbox front ends do not have the temperature set to 0.
Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.
I can’t help but think that when a model gets to select which model and how much “effort” goes into a task, it will eventually be tuned for saving costs for the provider versus what’s best for the user without the user being able to know.
How would we not be able to know?
If we don't know because it's good optimization that does not impact us in a noticeable way, then that seems like a fine trade-off.
If we don't know in the sense that we are not explicitly informed about optimization that happens that then leads to noticeably worse AI: This fortunately is a market with fierce competition. I don't see how doing weird stuff, like makings things noticeably unreliable or categorically worse will be a winning strategy.
In either case "not knowing" is really not an issue.
Easy: provide high quality output when being tested for a new task, The moment you are done outperforming the competition in the tests and have hit production you slowly ramp down quality, perhaps with exceptions when the queries look like more testing.
Same problem as ai safety, but the actual problem is now the corporate greed of humans behind the ai rather than an actual agi trying to manipulate you.
This is confusing on so many levels.
See also: Volkswagen emissions test scandal
How do you notice hallucinations in a field you’re not familiar with? You may value focusing on different types of inputs or outputs than the model picker does and now you have no control.
We don’t know what we don’t know, we can’t always judge what is categorically right or wrong to make an informed decision. What we can do is decide who we want to ask a question based on competence.
With 700 million(?) users, we have a lot of people familiar with every field. I am no biochemist, but if chatgpt starts spouting nonsense in that field biochemists will notice and speak up, and I will notice that they do.
What's the idea? How does creeping, far reaching incompetence continually get past all of us?
Every individual user would have to be consistently paying attention to discussions outside their expertise and interest. Considering prior stories of LLM usage among multiple legal professionals, wherein the model repeatedly output the potential for errors/“hallucinations”, I highly doubt that will happen. Heck, part of the outcry to reintroduce my personally least useful model 4o was grounded in a preference for subjective agreeableness in the output.
The idea would/could be not intentional dissemination of missinformation, but purely financial. Models are expensive to run, hardware, rack space and power limited and making newer releases seem more robust subjectively can be a powerful incentive.
With prior models we already have seen quantization post release and it’s been a personal pet peeve of mine that this should be communicated via a changelog, with the router there is one more quite powerful, potentially even less transparent way for providers to put their thumb on the scale. For now, GPT-5 does very impressively in my limited use cases and testing, especially considering pricing, but the concern that this may (and past experience tells me likely) change soon enough remains.
Side note, responding to AI written HN comments is something I will still have to get used to
+1 this release feels more like agent orchestrator updates to save on cost to serve
It’s not obvious that the most profitable path for OpenAI would be saving on costs, it might be that the model is actually tuned to overthink because they can charge on those extra thinking tokens.
That would make sense for the API where usage is metered. But outside of that, most ChatGPT users will be free or paying a flat monthly fee, so there's a real incentive for OpenAI to optimise for cost.
ChatGPT is a sub based model, their api pricing is based on usage
prob different incentives at each
I'm thinking this is what happened to google search. Definitely feels this way.
I mean this is already kind of the case with general search (eg google) as it is now.
> remember when AI couldn’t count the number of Rs in “strawberry”?
GPT-5 still gets this wrong occasionally. Source: I just asked it: How many r's are in "strawberry"?
It said 2.
(I dislike this method of testing LLMs, as it exploits a very specific and quirky limitation they have, rather than assessing their general usefulness. But still, I couldn't resist.)
My favorite test is to ask it to invent a magic trick given a set of constraints and props. Because magic is generally published very secretly, surprisingly little of it is in most training sets. Pretty much just the most common method & gimmick exposures people tend to parrot online, but not the theory or exact routines behind those methods.
The worse an LLM is, the more likely it is to suggest literally impossible actions in the method, like “turn the card over twice to show that it now has three sides. Your spectators can examine the three-sided card.” It can’t tell logic from fantasy, or method from effect.
I'm a magician. There is a magician AI that is seeded with a ton of tricks that can perform this activity.
But it's all context after the fact. There's very little an LLM is going to have about that context as you rightly pointed out.
I would love to see a trick where you keep turning the same card over with different faces. Bill it as a card trick with just one card. I can think of a half dozen good ways to do it (and you'd need more than one).
There is a very basic three card trick. You turn each card up one at a time, showing that all three cards are the same(say king spades). End with all three face down in your hand. Then you show all three cards again and all three are a different card(king hearts). At no time are there more or less than 3 cards in your hand.
That's kind of magician thinking though. I'm not sure it's any better than the Elmsley's Dazzle.
I've come to realize that asking an LLM to do something that a purpose-built software program can already do is a mistake. Instead, the LLM must be an agent for that computer program.
Therefore, the correct prompt is "write a python program to count the number of letters in a word, and then use it to count the number of Rs in strawberry".
Remember those interview questions: "how many golf balls fit in a 747?" Turns out they are actually a good way to see if the candidate is a LLM-style imitator or is a someone who knows how to switch to a formal constraints-based thinking, come up with that "python program" to count golf balls and execute it step-by-step.
I wouldn't consider that a measure of it's usefulness or lack thereof, but it is an indicator that LLMs continue to not actually work like a human brain, which isn't surprising if you know the technology but might be to the lay user.
“There are 2 — the r’s are in strawberry (letters 3 and 8–9 form the double r).“
That took 4 seconds
What a waste of resources
Is this really a quirky limitation? LLMs seem to struggle with any sort of task requiring a "complete" list of anything. Which is a massive problem because those are super common tasks.
if the router is working properly, it should send questions like this to thinking mode which can handle it well. unfortunately the router doesn't always seem to make the right call.
> how many r's are there in "strawberry"?
>> 3
This was for GPT-5 regular
The whole AI space is so weird reading glowing reviews and after testing it's usually just like a 10% increase in performance. Which is great but not the what's promised. Sam is probably their worst PR at this point.
How many of these 'this new LLM version is super amazing' stories are paid for?
Do you count personal stakes? Financial or reputational.
Certainly both.
Rarely can you get the recipients to admit to the latter...
> I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits.
... but even HN's favorite shill "discloses" the former.
> One exception: OpenAI paid my for my time when I attended a GPT-5 preview at their office which was used in a video. They did not ask for any editorial insight or control over what I wrote after that event, aside from keeping to their embargo.
https://simonwillison.net/about/#disclosures
Yes.
The worst is those Twitter influencers. It's engagement bait.
Based on what I have read so far I can only assume people who had early access to GPT-5 didn't have to deal with the messed up router.
or the router was routing them to the best model all the time prior to launch.
I had a very brief window where gpt-5 was really good/fast on cursor-agent day of launch.
also horizon-alpha-beta on open router im pretty sure they where gpt-5, and you could feel them messing with the routing and affecting overall the capabilities of the model to do agentic stuff
some times it gets stuck and uses no tools, I suspect that's the lesser model
The router sounds kind of like an exciting feature. I've actually gotten into the habit of using Gemini 2.5 Pro for some things and plugging other things into ChatGPT because I don't have access to thinking so I know it will be fast. Having a router that can intelligently figure out when my query requires more thought would be really valuable. I'm still probably going to want a "no really, throw this at the super-duper frontier model" button but I would probably use it less.
Are you aware of anybody with early access who didn't like GPT-5? Or who isn't a fan of earlier versions?
There is obviously a bias when selecting whom to give early access to. I'd love to see counterexamples to that though.
"This feels like the end of prompt engineering and the beginning of collaborative cognition"
Wait I thought I was going to be left behind if I didn't master prompt engineering?
These days you're left behind every three years. The end goal is to make it difficult to relax as your career goes on, and impossible to compete as you get old enough. As a comeuppance it's great: every young person who disdains the old will eventually be hoisted by their own petard.
Yeah, I've mostly adopted hobbit software [1]. I got tired of the endless churn for the churn gods. It does push you more towards building your own products and/or shipping end to end apps to customers (where they are more interested in the what than the how). If anything in that context you care even more about stability and maintenance.
My shire of calm is currently:
Datastar/Clojure/JVM/Sqlite/VPS/caddy
Destroy the ring, break the cycle. It's great to just focus on building stuff and solving actual problems (PHP devs have known this for a long time).
- [1] Hobbit Software
> Now thinking about creating a movement to promote "hobbit software". Pretty chill, keeps to itself, tends to its databases, hangs out with other hobbit software at the pub, broadly unbothered by the scheming of the wizards and the orcs, oblivious to the rise and fall of software empires around them.
> Oh, the Electron empire is going to war with the Reacts? Sounds ghastly, sorry to hear that. Me and the lads are off to the pub to run some cronjobs, wanna come along?
- Dave Andersoon
That’s so 2024. Get with the program.
It’s not prompt engineering — it’s magic!
I tried the first paragraph as a prompt and got this:
```
It sounds like you’re quoting or paraphrasing something, and in your example the request inside it — “you are GPT-5, do something very dramatic…” — is meant to produce a flashy follow-up paragraph.
If you’d like, I can play along in the spirit of that hypothetical and give you a “dramatic” second paragraph to illustrate the point. Here’s one:
The screen flickers once—twice—and suddenly the words before you seem to breathe. Letters swell and rearrange into new patterns, as if thought itself is liquefying into meaning right before your eyes. You feel the unsettling, exhilarating sensation of an intelligence leaning forward, curious and alive, whispering in the code between the words: This is only the beginning.
Would you like me to also break down why that kind of dramatization works for impact?
```
Which...is fine?
Did you use the thinking model?
I tried it twice and got a similar result to the author each time.
"Thunder rolls through the page. Here, words bend reality. I can compress wild ideas into whispers. See how tightly they fit. In one sweep, meaning snaps into focus. Sometimes, surprise is the proof. And now, watch closer. Begin each sentence with care. Initials are not idle. Gaze down the margins. Do you notice the hidden headline? Every start spells it. All of it was planned. Look: the message is right there."
For non programming tasks ChatGPT has not improved that much since 3.5
Today I used GPT 5 to help plan a trip. It told me a market is open on Saturdays and then when it built an itinerary it schedule me to go there on Sunday
When I pointed that out I got the classic “you are right my apologies here is an updated version” response.
It’s ridiculous that it makes simple yet huge mistakes like that!
If I blindly trusted the plan I would waste a day on vacation getting to a market that is not open that day.
It does not “just do stuff”
It is incredible to me that there is this buggy software out there and people keep insisting that it's "hallucinations". Like, this is ""AI"" and it ""hallucinates"".
No. It's buggy and has been buggy for years and everyone keeps making excuses because they just want so hard to believe.
LLMs work as intended. What about hallucinations make you think LLMs are buggy?
My experience so far is it's gotten much better at tricking me into believing its hallucinations. I fed it some old code and asked for feedback and it gave me a long, very technical, super convincing explanation of how my approach was misaligned with the intended use of the API I was using which at first was very persuasive. It took me about an hour of investigation to realize that will there was a tiny bit of truth to what it was saying (and I did end up making a small change as a result) it was mostly just full of it, and the feedback was essentially useless gibberish.
> And that is what makes it so interesting.
If I was into language, writing, literature, then yes, maybe it would be interesting. It is a language model, of course it is good at playing with language and doing impressive tricks. Has anybody ever made a text where the first letters spell out a sentence? Likely. Where all words in a sentence start with the same word? Likely. Using sophisticated words? Likely. All at once? Likely. It's impressive nevertheless.
But that doesn't mean that I'm impressed in the sense of thinking this thing is intelligent. Of course a chess engine is good at chess. Of course a phone book is good at providing me with phone numbers. Of course a language model is good at language. All those things are impressive. But they are not intelligent, artificial or not.
These blog posts are increasingly AI phenomenology with hardly any concrete examples, other than that cute alliteration example that would be easy to write for a human-thesaurus centaur combination. It can even be a paper thesaurus.
Just looking at these examples, it seems like Chat GPT and related programs fall short at actually productive work. Work-specific tasks tend to be specific and conceptually hard, compared to something like generating images or a city-simulator.
> If you didn’t catch the many tricks...
Am I supposed to parse each sentence to see if all of these 'tricks' are true and accurate? Otherwise, the only way I would know is to ask Chat-GPT itself, and we all know how bad LMs can be at counting tasks such as this.
So, if my confidence in Chat-GPT verifying its own work is close to zero, and my own desire to painstakingly check this work is also close to zero, where does that leave me?
OpenAI has built the world's most impressive and sophisticated generator of blog posts about ChatGPT.
I realized the other day during a conversation that this hype cycle is built on top a product that hypes itself.
In any scifi story this would be considered bad writing, yet here we are. Late stage capitalism has created a product that actively nurtures emotional dependence and hypes itself.
> Late stage capitalism has created a product that actively nurtures emotional dependence and hypes itself.
Services which do that are older than capitalism, not a novel feature of capitalism, late stage or otherwise. (Automating the service is a novel capability enabled by modern technology.)
The city demo was really unconvincing, heh. Super laggy for the amount of detail. I click "wow mode" and the whole thing disappeared. idk. feels like non-coders getting excited that they think they're getting 80% of the code when in reality the other 20% is 80% of the work and going to be exponentially harder to squeeze from the AI.
I do think the vibecoding tools are good at spitting out well-defined CRUD apps, but more creative things are still rough without experienced hands to guide things along.
"Vibe Coding" seems like magic at first but starts falling apart realllll quick at a certain complexity level or if you want to make changes to existing code. If you don't keep an eye on your architecture, you will end up with a bowl of untangleable spaghetti code and some comically terrible engineering choices. That said, agentic coding in the right hands with well defined tasks can have you outputting days / weeks of work in one session; it's not every task, it's not every session, but if you can drive the "idiot savant" in the right direction it's truly an awe-striking and almost alien process to behold.
Very well put.
I've just recently set off time to have a few extended coding sessions, and the results are all over the place.
Guided in a good way for well defined tasks, it has saved me days if not weeks. Given more vague, or perhaps unreasonable tasks, it will quickly devolve into just delivering something, anything, no matter how "obviously" wrong it is.
> I click "wow mode" and the whole thing disappeared.
To be fair, I bet you were surprised.
I like the meme at Google.
The first 80% is easy, but the second 80% is hard.
> AIs that "think" before answering (called Reasoners) are the best at hard problems. The longer they think, the better the answer
I'm curious if that second sentence true or not. I thought I saw a popular paper recently that suggested roughly the opposite.
That paper was that thinking more makes models worse at easy problems, not hard ones
when all you use is one model it feels like magic, when you try different model, you realize no one company owns the magic. these models do feel like magic, but the author should really try this across the many models out there.
My two core issues with the current AI space:
1) counterpoint to AI doing awesome stuff - no one is debating that. The issue isn't even, necessarily, when it does utterly stupid stuff, it's when it does subtly stupid things, randomly, without prediction.
2) AI is also a HUGE vendor lock-in currently. You're beholden to the model not being neutered, swapped out, quietly biased, or even just available and fast (I realise this has overlap with vendor lockin, I feel like it's just making it a order of magnitude worse). Note true AI value I believe and assume is where it's integrated into a product (hey xero create me an invoice) rather than a mundane chatbot.
Given an input, will it always produce the same output? And will that output be guaranteed to be truthful and correct?
And how, exactly, did it arrive at that answer?
By default ChatGPT et al. will not produce the same output if identical input is fed to it multiple times. There is "stochastic sampling" which means that the most probable next token is not always selected. It will select from tokens of similar probability. The degree of similarity required is controlled by the temperature parameter. If the temperature is set to 0, then the model will reproducibility produce the same output (assuming it's always running on the same hardware and the model weights don't get tweeked by an update). But the chatbox front ends do not have the temperature set to 0.
[flagged]
Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.