As a security researcher, I am both salivating at the potential that the proliferation of TDD and other AI-centric "development" brings for me, and scared for IT at the same time.
Before we just had code that devs don't know how to build securely.
Now we'll have code that the devs don't even know what it's doing internally.
Someone found a critical RCE in your code? Good luck learning your own codebase starting now!
"Oh, but we'll just ask AI to write it again, and the code will (maybe) be different enough that the exact same vuln won't work anymore!" <- some person who is going to be updating their resume soon.
I'm going to repurpose the term, and start calling AI-coding "de-dev".
Just few days ago I spoke with sec guy who was telling me how frustrating it is to validate AI code.
The problem is marketing.
Cycling industry is akin to audiophiles and will swear on their lives that $15,000 bicycle is the pinnacle of human engineering. This year's bike will go 11% faster than the previous model. But if you read last 10 years of marketing materials and do math it should basically ride itself.
There's so much money in AI right now that you can't really expect anyone to say "well, we had hopes, but it doesn't really work the way we expected". Instead you have pitch after pitch, masses parroting CEOs, and everyone wants to get a seat on the hype train.
It's easy to dispel audiophiles or carbon enthusiasts but it's not so easy with AI, because no one really knows how it works. OpenAI released a paper in which they stated, sorry for paraphrasing, "we did this, we did that, and we don't know why results were different".
When Karpathy wrote Software 2.0 I was super excited.
I naively believed that we'll start building black boxes based on requirements, sets of inputs and outputs, and sudden changes of heart from stakeholders that often happen on a daily basis for many of us and mandates almost complete reimagination of project architecture will simply need another pass of training with new parameters.
Instead the mainstream is pushing hard reality where we mass produce a ton of code until it starts to work within guard rails.
Does it really work? Is it maintainable?
Get out of here. We're moving at 200mph.
The way to code going forward with AI is Test Driven Development. The code itself no longer matters. You give the AI a set of requirements, ie. tests that need to pass, and then let it code whatever way it needs to in order to fulfill those requirements. That's it. The new reality us programmers need to face is that code itself has an exact value of $0. That's because AI can generate it, and with every new iteration of the AI, the internal code will get better. What matters now are the prompts.
I always thought TDD was garbage, but now with AI it's the only thing that makes sense. The code itself doesn't matter at all, the only thing that matters is the tests that will prove to the AI that their code is good enough. It can be dogshit code but if it passes all the tests, then it's "good enough". Then, just wait a few months and then rerun the code generation with a new version of the AI and the code will be better. The humans don't need to know what the code actually is. If they find a bug, write a new test and force the AI to rewrite the code to include the new test.
I think TDD has really found its future now that AI coding is here to stay. Human code doesn't matter anymore and in fact I would wager that modifying AI generated code is as bad and a burden. We will need to make sure the test cases are accurate and describe what the AI needs to generate, but that's it.
I never bought into TDD because it is only usefull for business logic, plain algorithms and data structures, it is no accident that is what 99% of conference talks and books focus on.
There isn't a single TDD talk about shader programming for GPGPU, and validating that what the shader algorithms produce via automated tests, the reason being the amount of enginneering effort only to make it work, and still lacks human sensitivity for what gets rendered.
This is incorrect for a lot of reasons, many of which have already been explored, but also:
> with every new iteration of the AI, the internal code will get better
This is a claim that requires proof; it cannot just be asserted as fact. Especially because there's a silent "appreciably" hidden in there between "get" and "better" which has been less and less apparent with each new model. In fact, it more and more looks like "Moore's law for AI" is dead or dying, and we're approaching an upper limit where we'll need to find ways to be properly productive with models only effectively as good as what we already have!
Additionally, there's a relevant adage in computer science: "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." If the code being written is already at the frontier capabilities of these models, how the hell are they supposed to fix the bugs that crop up, especially if we can't rely on them getting twice as smart? ("They won't write the bugs in the first place" is not a realistic answer, btw.)
The argument they are making is that if a bug is discovered, the agent will not debug it, instead a new test case is created, and the code is regenerated (I suppose if a quick fix isn't found). That is why they don't need debugging agent twice as capable as coding agent. I don't know if this works in practice, as in my experience, tests are intertwined with the code base.
While TDD can have some merits, I think this is being way to generous to the value of tests. As Dijkstra said once, "Testing shows the presence, not the absence of bugs." I'm not a devout follower of Uncle Bob, but I was just thumbing through Clean Architecture today and he has a whole section to this point (including the above quote). Right after that quote he writes, "a program can be proven incorrect by a test, but it can not be proven correct." Which is largely true. The only garuntee of TDD is you can show a set of behaviors your program doesn't do, it never proves what the program actually does. To extrapolate to here, all TDD does it put up guardrails for the the AI should not generate.
It depends on how you define testing now: Property-based testing would test sets of behaviors. The main idea is: Formalize your goal before implementing. So specification driven development would be the thing to aim for. And at some point we might be able to model check (proof) the code that has been generated. Then we are the good old idea of code synthesis.
TDD is testing in production in disguise. After all, bugs are unexpected and you can’t write tests for a bug you don’t expect. Then the bug crops up in production and you update the test suite.
TDD has always been about two things for me; be able to move forward faster because I have something easy to execute that compares it against the known wanted state, and in the future preventing unwanted regressions. I'm not sure I've ever thought of unit testing as "prevent potential future bugs", mostly up front design prevents that, or I'd use property testing, but neither of those are inside the whole "write test then write code" flow.
The intended workflow of TDD is to write a set of tests before some code. The only reason that makes sense conceptually is to prevent possible future bugs from going undetected.
Put another way if your TDD always pass then there’s no point in writing them, and there’s no known bugs before you have any code. So discovering future bugs that didn’t exist when you’re writing those tests is the point.
I don’t really understand how to write tests before the code… When I write code, the hard part is writing the code which establishes the language to solve the problem in, which is the same language the tests will be written in. Also, once I have written the code I have a much better understanding as the problem, and I am in a way better position to write the correct tests.
I am not really sure if TDD often is compatible with modern agile development. It lends well to more waterfall style. Or clearly defined systems.
If you can design fully what your system does before starting it is more reasonable. And often that means going down to level of are inputs and states. Think more of something like control systems for say mobile networks or planes or factory control. You could design whole operation and all states that should happen or could happen before single line of code.
TDD fits better when you use a bottom up style of coding.
For a simple example, FuzzBuzz as a loop that has some if statements inside is not so easy to test. Instead break it in half so you have a function that does the fiddly bits and a loop that just contains “output += MakeFizzBizzLineForNumeber(X);” Now it’s easy to come up tests for likely mistakes and conceptually you’re working with two simpler problems with clear boundaries between them.
In a slightly different context you might have a function that decides which kind of account to create based on some criteria which then returns the account type rather than creating the account. That function’s logic is then testable by passing in some parameters and then looking at the type of account returned without actually creating any accounts. Getting good at this requires looking at programs in a more abstract way, but a secondary benefit is rather easy to maintain code at the cost of a little bookkeeping. Just don’t go overboard, the value is breaking out bits that are likely to contain bugs at some point where abstraction for abstraction’s sake is just wasted effort.
That's great for rote work, simple CRUD, and other things where you already know how the code should work so you can write a test first. Not all programming works well that way. I often have a goal I want to achieve, but no clue exactly how to get there at first. It takes quite a lot of experimentation, iteration and refinement before I have anything worth testing - and I've been programming 40+ years, so it's not because I don't know what I'm doing.
You write the requirements, you write the spec, etc. before you write the code.
You then determine what are the inputs / outputs that you're taking for each function / method / class / etc.
You also determine what these functions / methods / classes / etc. compute within their blocks.
Now you have that on paper and have it planned out, so you write tests first for valid / invalid values, edge cases, etc.
There are workflows that work for this, but nowadays I automate a lot of test creation. It's a lot easier to hack a few iterations first, play with it, then when I have my desired behaviour I write some tests. Gradually you just write tests first, you may even keep a repo somewhere for tests you might use again for common patterns.
> You give the AI a set of requirements, ie. tests that need to pass, and then let it code whatever way it needs to in order to fulfill those requirements.
SQLite has tests-lines-to-code-lines ratio above 1000 (yes, 1000 lines of tests for single line of code) and still has bugs.
AMD, at the time it decided to apply ACL2 to its FPU, had 29 million tests (not lines of code, but test inputs and outputs). ACL2 verification found several bugs in the FPU.
Just to make a couple of points for someone to draw a line.
The code always matters. Black box coding like this leads to systems you can't explain, and that's your whole damn job: to understand the system you're building. Anything less is negligence.
If the code doesn't matter anymore, in order of it to be of any quality the test should be as detailed as was the code in the first place, you'd end up writing the code in tests more or less.
The problem is — nobody commits code that fails tests.
The bugs occur because the initial tests didn’t fully capture the desired and undesired behaviors.
I’ve never seen a formal list of software requirements state that a product cannot take more than an hour to do a (trivial) operation. Nobody writes that out because it’s implicitly understood.
Imagine writing a “life for dummies” textbook on how to grow from a 5yr old to 10yr old. It’s impossible to fully cover.
TDD doesn’t ensure the code is maintainable, extendable, follows best practices, etc, and while AI might write some code that can pass tests while the code is relatively small, I would expect in the long run it will find it extremely difficult to just “rewrite everything based on this set of new requirements” and then do that again, and again, and again, each time potentially choosing entirely different architectures for the solution.
AI has a hard time working with code that humans would consider hard to maintain and hard to extend.
If you give AI a set of tests to pass and turn it loose with no oversight, it will happily spit out 500k LOC when 500 would do. And then it will have a very hard time when you ask it to add some functionality.
AI routinely writes code that is beyond its ability to maintain
and extend. It can’t just one shot large code bases either, so any attempt to “regenerate the code”
is going to run into these same issues.
> If you give AI a set of tests to pass and turn it loose with no oversight, it will happily spit out 500k LOC when 500 would do. And then it will have a very hard time when you ask it to add some functionality.
I've been playing around with getting the AI to write a program, where I pretend I don't know anything about coding, only giving it scenarios that need to work in a specific way. The program is about financial planning and tax computations.
I recently discovered AI had implemented four different tax predictions to meet different scenarios. All of them incompatible and all incorrect but able to pass the specific test scenarios because it hardcoded which one to use for which test.
This is the kind of mess I'm seeing in the code when AI is left alone to just meet requirements without any oversight on the code itself.
The reason AI code generation works so well is a) it is text based- the training data is huge and b) the output is not the final result but a human readable blueprint (source code), ready to be made fit by a human who can form an abstract idea of the whole in his head. The final product is the compiled machine code, we use compilers to do that, not LLMs.
Ai genereted code is not suitable to be directly transferred to the final product awaiting validation by TDD, it would simply be very inefficient to do so.
> code itself has an exact value of $0. That's because AI can generate it
That's only true for problems that has been solved and well documented before.
AI can't solve novel problems. I have ton of examples I use from time to time when new models come out. I've tried to ride the hype train, and I've been frustrated working with people before, but I've never been so frustrated as trying to make AI follow simple set of rules and getting:
"Oh yes, my bad, I get that now. Black is white and white is black. Let me rewrite the code..."
My favorite example is tasked AI with a rudimentary task and it gave me a working answer but it was fishy, so I googled the answer and lo and behold I landed on stackoverflow page with exact same answer being top voted answer to question very similar to my task. But that answer also had a ton of comments explaining why you never should do it that way.
I've been told many times that "you know, kubernetes is so complicated, but I tell AI what I want and it gives me a command I simply paste in my terminal".
Fuck no.
AI is great for scaffolding projects, working with typical web apps where you have repeatable, well documented scenarios, etc.
I remember talking about this with a friend a long time ago. Basically, you'd write up tests and there was a magic engine that would generate code that would self-assemble and pass tests. There was no guarantee that the code would look good or be efficient--just that it passed the tests.
We had no clue that this could actually happen one day in the form of gen AI. I want to agree with you just to prove that I was right!
This is going to bring up a huge issue though: nailing requirements. Because of the nature of this, you're going to have to spec out everything in great detail to avoid edge cases. At that point, will the juice be worth the squeeze? Maybe. It feels like good businesses are thorough with those kinds of requirements.
The idea of TDD is that you should have the tests before you have the code. If your code is failing in real life before you have the tests, that's no longer TDD.
I mostly agree, but why stop at tests? Shouldn’t it be spec driven development? Then neither the code or the language matter. Wouldn’t user stories and requirements à la bdd (see cucumber) be the right abstraction?
Natural language is too ambiguous for this, which makes it impossible to automatically verify
What you need is indeed spec-driven development, but specs need to be written in some kind of language that allows for more formal verification. Something like https://en.wikipedia.org/wiki/Design_by_contract, basically.
It is extremely ironic that, instead, the two languages that LLMs are the most proficient in - and thus the ones most heavily used for AI coding - are JavaScript and Python...
I don't think you're wrong but I feel like there's a big bridge between the spec and the code. I think the tests are the part that will be able to give the AI enough context to "get it right" quicker.
It's sort of like a director telling an AI the high level plot of a movie, vs giving an AI the actual storyboards. The storyboards will better capture the vision of the director vs just a high level plot description, in my opinion.
The irony is that I tried this with a project I've been meaning to bang out for years, and I think the OP's idea a natural thought to have when working with LLMs: "what if TTD but with LLMs"
When I tried it, it "worked", I admittedly felt really good about it, but I stepped away for a few weeks because of life and now I can't tell you how it works beyond the high level concepts I fed into the LLM.
When there's bugs, I basically have to derive from first principles where/how/why the bug happens instead of having good intuition on where the problem lies because I read/wrote/reviewed/integrated with the code myself.
I've tried this method of development with various levels of involvement in implementation itself and the conclusion I came to is if I didn't write the code, it isn't "mine" in every sense of the term, not just in terms of legal or moral ownership, but also in the sense of having a full mental model of the code in a way I can intellectually and intuitively own it.
Really digging into the tests and code, there are fundamental misunderstandings that are very, very hard to discern when doing the whole agent interfacing loop. I believe they're the types of errors you'd only pick up on if you wrote the code yourself, you have to be in that headspace to see the problem.
Also, I'd be embarrassed to put my name on the project, given my lack of implementation, understanding and the overall quality of the code, tests, architecture, etc. It isn't honest and it's clearly AI slop.
It did make me feel really productive and clever while doing it, though.
> We will need to make sure the test cases are accurate and describe what the AI needs to generate, but that's it.
Yes. The first thing I always check in every project (an especially vibe-coded projects) is whether if:
A. Does it have tests?
B. Is the coverage over 70%?
C. Do the tests actually test for the behaviour of the code (good) or just its implementation (bad.)
If any of those requirements are missing, then that is a red flag for the project.
While TDD is absolutely valuable for clean code, focusing too much on it can be the death of a startup.
As you said the code itself is $0, then the first product is still worth $10 and the finished product is worth $1M+ once it makes money, which is what matters.
This is an interesting take — shifting focus from “writing the best code” to “defining the right tests” makes sense in an AI-driven world. But I’m skeptical if treating the generated code as essentially disposable is wise — tests can catch a lot, but they won’t automatically enforce readability, maintainability, or ensure unexpected behaviors don’t slip through
> Instead, we use an approach where a human and AI agent collaborate to produce the code changes. For our team, every commit has an engineer's name attached to it, and that engineer ultimately needs to review and stand behind the code. We use steering rules to setup constraints for how the AI agent should operate within our codebase,
This sounds a lot like Tesla's Fake Self Driving. It self drives right up to the crash, then the user is blamed.
Except here it's made abundantly clear, up front, who has responsibility. There's no pretense that it's fully self driving. And the engineer has the power to modify every bit of that decision.
Part of being a mature engineer is knowing when to use which tools, and accepting responsibility for your decisions.
It's not that different from collaborating with a junior engineer. This one can just churn out a lot more code, and has occasional flashes of brilliance, and occasional flashes of inanity.
This is the first time I see "steering rules" mentioned. I do something similar with Claude, curious how it looks for them and how they integrate it with Q/Kiro.
Those rules are often ignored by agents. Codex is known to be quite adhering, but it falls back to its own ideas, which run counter to rules I‘ve given it. The longer a session goes on, the more it goes off the rails.
I'm aware of the issues around rules as in a default prompt. I had hoped the author of the blog meant a different mechanism when they mentioned "steering rules". I do mean something different, where an agent will self-correct when it is seen going against rules in the initial prompt. I have a different setup myself for Claude Code, and would call parts of that "steering"; adjusting the trajectory of the agent as it goes.
With Claude Code, you can intercept its prompts if you start it in a wrapper and mock fetch (someone with github user handle „badlogic“ did this, but I can’t find the repo now). For all other things (and codex, Cursor) you‘d need to proxy/isolate all comms with the system heavily.
Yes they do, most of the time. Then they don’t. Yesterday, I told codex that it must always run tests by invoking a make target. That target is even configurable w/ parameters, eg to filter by test name. But always, at some point in the session, codex started disregarding that rule and fell back to using the platform native test tool directly. I used strong language to steer it back, but 20% or so of context later, it did that again.
"steering rules" is a core feature baked into Kiro. It's similar to the spec files use in most agentic workflows but you can use exclusion and inclusion rules to avoid wasting context.
There's currently not an official workflow on how to manage these steering files across repos if you want to have organisation-wide standards, which is probably my main criticism.
This article is right, but I think it may underplay the changes that could be coming soon. For instance, as the top comment here about TDD points out, the actual code does not matter anymore. This is an astounding claim! And it has naturally received a lot of objections in the replies.
But I think the objections can mostly be overcome with a minor adjustment: You only need to couple TDD with a functional programming style. Functional programming lets you tightly control the context of each coding task, which makes AI models ridiculously good at generating the right code.
Given that, if most of your code is tightly-scoped, well-tested components implementing orthogonal functionality, the actual code within those components will not matter. Only glue code becomes important and that too could become much more amenable to extensive integration testing.
At that point, even the test code may not matter much, just the test-cases. So as a developer you would only really need to review and tweak the test cases. I call this "Test-Case-Only Development" (TCOD?)
The actual code can be completely abstracted away, and your main task becomes design and architecture.
All the downsides that have been mentioned will be true, but also may not matter anymore. E.g. in a large team and large codebase, this will lead to a lot of duplicate code with low cohesion. However, if that code does what it is supposed to and is well-tested, does the duplication matter? DRY was an important principle when the cost of code was high, and so you wanted to have as much leverage as possible via reuse. You also wanted to minimize code because it is a liability (bugs, tech debt, etc.) and testing, which required even more code that still didn't guarantee lack of bugs, was also very expensive.
But now that the cost of code is plummeting, that calculus is shifting too. You can churn out code and tests (including even performance tests, which are always an afterthought, if thought of at all) at unimaginable rates.
And all this while reducing the dependencies of developers on libraries and frameworks and each other. Fewer dependencies means higher velocity. The overall code "goodput" will likely vastly outweight inefficiences like duplication.
Unfortunately, as TFA indicates, there is a huge impedance mismatch with this and the architectures (e.g. most code is OO, not functional), frameworks, and processes we have today. Companies will have to make tough decisions about where they are and where they want to get.
I suspect AI-assisted coding taken to its logical conclusion is going to look very different from what we're used to.
The biggest thing that stood out to me was that they suddenly started working nonstop, even on weekends…? If AI is so great, why can’t they get a single day off in two months?
Well, 'calculus' is the kind of marketing word that sounds more impressive than 'arithmetic' and I think 'quantum logic' has gone a bit stale, and 'AI-based' might give more hope to the anxious investor class, as 'AI-assisted' is a bit weak as it means the core developer team isn't going to be cut from the labor costs on the balance sheet, they're just going to be 'assisted' (things like AI-written unit tests that still need some checking).
"The Arithmetic of AI-Assisted Coding Looks Marginal" would be the more honest article title.
Yes, unfortunately a phrase that's used in an attempt to lend gravitas and/or intimidate people. It sort of vaguely indicates "a complex process you wouldn't be interested in and couldn't possibly understand". At the same time it attempts to disarm any accusation of bias in advance by hinting at purely mechanistic procedures.
Could be the other way around, but I think marketing-speak is taking cues here from legal-ese and especially the US supreme court, where it's frequently used by the justices. They love to talk about "ethical calculus" and the "calculus of stare decisis" as if they were following any rigorous process or believed in precedent if it's not convenient. New translation from original Latin: "we do what we want and do not intend to explain". Calculus, huh? Show your work and point to a real procedure or STFU
Yep. The problem is then leadership sees this and says "oh, we too can expect 10x productivity if everyone uses these tools. We'll force people to use them or else."
And guess what happens? Reality doesn't match expectations and everyone ends up miserable.
Good engineering orgs should have engineers deciding what tools are appropriate based on what they're trying to do.
I've never worked anywhere that knew where they were going well enough that it was even possible to be a month ahead of schedule. By the time a month has elapsed the plan is entirely different.
AI can't keep up because its context window is full of yesteryear's wrong ideas about what next month will look like.
Yeah, this is the main problem. Writing of code just isn't the bottle neck. It's the discovery of the business case that is the hard part. And if you don't know what it is, you can't prompt your way out of it.
We've been having a go around with corporate leadership at my company about "AI is going to solve our problems". Dude, you don't even know what our problems are. How are you going to prompt the AI to analyze a 300 page PDF on budget policy when you can't even tell me how you read a 300 page PDF with your eyes to analyze the budget policy.
I'm tempted to give them what they want: just a chatter box they can ask, "analyze this budget policy for me", just so I can see the looks on their faces when it spits out five poorly written paragraphs full of niceties that talk its way around ever doing any analysis.
I don't know, maybe I'm too much of a perfectionist. Maybe I'm the problem because I value getting the right answer rather than just spitting out reams of text nobody is ever going to read anyway. Maybe it's better to send the client a bill and hope they are using their own AIs to evaluate the work rather than reading it themselves? Who would ever think we were intentionally engaging in Fraud, Waste, and Abuse if it was the AI that did it?
> I'm tempted to give them what they want: just a chatter box they can ask, "analyze this budget policy for me", just so I can see the looks on their faces when it spits out five poorly written paragraphs full of niceties that talk its way around ever doing any analysis.
Ah, but they'll love it.
> I don't know, maybe I'm too much of a perfectionist. Maybe I'm the problem because I value getting the right answer rather than just spitting out reams of text nobody is ever going to read anyway. Maybe it's better to send the client a bill and hope they are using their own AIs to evaluate the work rather than reading it themselves? Who would ever think we were intentionally engaging in Fraud, Waste, and Abuse if it was the AI that did it?
We're already doing all the same stuff, except today it's not the AI that's doing that, it's people. One overworked and stressed person somewhere makes for a poorly designed, buggy library, and then millions of other overworked and stressed people spend most of their time at work finding out how to cobble dozens of such poorly designed and buggy pieces of code together into something that kinda sorta works.
This is why the top management is so bullshit on AI. It's because it's a perfect fit for a model that they have already established.
I've got my own gripes about leadership, but I'm finding that even when its a goal I've set for myself, watching an AI fail at it represents a refinement of what I thought I wanted: I'm not much better than they are.
That, or its a discovery of why what I wanted is impossible and it's back to the drawing board.
It's nice to not be throwing away code that I'd otherwise have been a perfectionist about (and still thrown away).
Looking at the “metrics” they shared, going from committing just about zero code over the last two years to more than zero in the past two months may be a 10x improvement. I haven’t seen any evidence more experienced developers see anywhere near that speedup.
Sure, but probably your pre-copilot IDE was autocompleting 7-8 of those lines anyway, just by playing type tetris, and typing the code out was never the slow part?
These are the kind of people that create two letter aliases to avoid typing “git pull” or whatever. Congrats, very efficient, saving 10 seconds per day.
So, yeah, they probably think typing is a huge bottle neck and it’s a huge time saver.
Why would that save me a significant amount of time versus writing the code myself means I don't have to spend a bunch of time analyzing it to figure it what it does?
"We have real mock versions of all our dependencies!"
Congratulations, you invented end-to-end testing.
"We have yellow flags when the build breaks!"
Congratulations! You invented backpressure.
Every team has different needs and path dependencies, so settles on a different interpretation of CI/CD and software eng process. Productizing anything in this space is going to be an uphill battle to yank away teams' hard-earned processes.
Productizing process is hard but it's been done before! When paired with a LOT of spruiking it can really progress the field. It's how we got the first CI/CD tools (eg. https://en.wikipedia.org/wiki/CruiseControl) and testing libraries (eg. pytest)
"For me, roughly 80% of the code I commit these days is written by the AI agent"
Therefore, it is not commited by you, but by you in the name of AI agent and the holy slop.
What to say, I hope that 100x productivity is worth it and you are making tons of money.
If this stuff becomes mainstream, I suggest open source developers stop doing the grind part, stop writing and maintaining cool libraries and just leave all to the productivity guys, let's see how far they get.
Maybe I've seen too many 1000x hacker news..
Just need the feedback to follow suit to be 100x as effective. Tests, docs and rapid loops of guidance with human in the loop. Split your tasks, find the structure that works.
I think it's fine. For example, "I" made this library https://github.com/anchpop/weblocks . It might be more accurate to say that I directed AI to make it, because I didn't write a line of code myself. (And I looked at the code and it is truly terrible.) But I tested that it works, and it does, and it solves my problem perfectly. Yes, it is slop, but this is a leaf node in the abstraction graph, and no one needs to look at it again now that it it written
Most code, though, is not write once and ignore. So it does matter if its crap, because every piece of software is only as good as its least dependency.
Fine for just you. Not fine for others, not fine for business, not fine the moment you star count starts moving.
I switched back to Rails for my side project a month ago and ai coding when doing not too complex stuff has been great. While the old NextJS code base was in shambles.
Before I was still doing a good chunk of the NextJS coding. I’m probably going to be directly coding less than 10% of the code base from here on out. I’m now spending time trying to automate things as much as possible, make my workflow better, and see what things can be coded without me in the loop. The stuff I’m talking about is basic CRUD and scraping/crawling.
For serious coding, I’d think coding yourself and having ai as your pair programmer is still the way to go.
"My team is no different—we are producing code at 10x of typical high-velocity team. That's not hyperbole - we've actually collected and analyzed the metrics."
Rofl
"The Cost-Benefit Rebalance"
In here he basically just talks about setting up mock dependencies and introducing intermittent failures into them. Mock dependencies have been around for decades, nothing new here.
It sounds like this test system you set up is as time consuming as solving the actual problems you're trying to solve, so what time are you saving?
"Driving Fast Requires Tighter Feedback Loop"
Yes if you're code-vomiting with agents and your test infrastructure isn't rock solid things will fall apart fast, that's obvious. But setting up a rock solid test infrastructure for your system involves basically solving most of the hard problems in the first place. So again, what? What value are you gaining here?
"The communication bottleneck"
Amazon was doing this when I worked there 12 years ago. We all sat in the same room.
"The gains are real - our team's 10x throughput increase isn't theoretical, it's measurable."
Show the data and proof. Doubt.
Yeah I don't know. This reads like complete nonsense honestly.
Paraphrasing:
"AI will give us huge gains, and we're already seeing it. But our pipelines and testing will need to be way stronger to withstand the massive increase in velocity!"
Velocity to do what? What are you guys even doing?
We're back to using LOC as a productivity metric because LLMs are best at cranking out thousands of LOC really fast. Personal experience I had a colleague use Claude Code top create a PR consisting of a dozen files and thousands of line of code for something that could have been done in a couple hundred LOC in a single file.
> We're back to using LOC as a productivity metric because LLMs are best at cranking out thousands of LOC really fast.
Can you point me to anyone who knows what they're talking about declaring that LOC is the best productivity metric for AI-assisted software development?
Are you implying that the author of this article doesn't know what they are talking about? Because they basically declared it in the article we just read.
Can you point me to where the author of this article gives any proof to the claim of 10x increased productivity other than the screenshot of their git commits, which shows more squares in recent weeks? I know git commits could be net deleting code rather than adding code, but that's still using LOC, or number of commits as a proxy to it, as a metric.
> I know git commits could be net deleting code rather than adding code…
Yes, I'm also reading that the author believes commit velocity is one reflection of the productivity increases they're seeing, but I assume they're not a moron and has access to many other signals they're not sharing with us. Probably stuff like: https://www.amazon.science/blog/measuring-the-effectiveness-...
I had a coworker use Copilot to implement tab indexing through a Material UI DataGrid. The code was a few hundred lines. I showed them a way to do it in literally one line passed in the slot properties.
"Our testing needs to be better to handle all this increased velocity" reads to me like a euphemistic way of saying "we've 10x'ed the amount of broken garbage we're producing".
This reads like "Hey, we're not vibe coding, but when we do, we're careful!" with hints of "AI coding changes the costs associated with writing code, designing features, and refactoring" sprinkles in to stand out.
As a security researcher, I am both salivating at the potential that the proliferation of TDD and other AI-centric "development" brings for me, and scared for IT at the same time.
Before we just had code that devs don't know how to build securely.
Now we'll have code that the devs don't even know what it's doing internally.
Someone found a critical RCE in your code? Good luck learning your own codebase starting now!
"Oh, but we'll just ask AI to write it again, and the code will (maybe) be different enough that the exact same vuln won't work anymore!" <- some person who is going to be updating their resume soon.
I'm going to repurpose the term, and start calling AI-coding "de-dev".
Just few days ago I spoke with sec guy who was telling me how frustrating it is to validate AI code.
The problem is marketing.
Cycling industry is akin to audiophiles and will swear on their lives that $15,000 bicycle is the pinnacle of human engineering. This year's bike will go 11% faster than the previous model. But if you read last 10 years of marketing materials and do math it should basically ride itself.
There's so much money in AI right now that you can't really expect anyone to say "well, we had hopes, but it doesn't really work the way we expected". Instead you have pitch after pitch, masses parroting CEOs, and everyone wants to get a seat on the hype train.
It's easy to dispel audiophiles or carbon enthusiasts but it's not so easy with AI, because no one really knows how it works. OpenAI released a paper in which they stated, sorry for paraphrasing, "we did this, we did that, and we don't know why results were different".
When Karpathy wrote Software 2.0 I was super excited.
I naively believed that we'll start building black boxes based on requirements, sets of inputs and outputs, and sudden changes of heart from stakeholders that often happen on a daily basis for many of us and mandates almost complete reimagination of project architecture will simply need another pass of training with new parameters.
Instead the mainstream is pushing hard reality where we mass produce a ton of code until it starts to work within guard rails.
No.
The way to code going forward with AI is Test Driven Development. The code itself no longer matters. You give the AI a set of requirements, ie. tests that need to pass, and then let it code whatever way it needs to in order to fulfill those requirements. That's it. The new reality us programmers need to face is that code itself has an exact value of $0. That's because AI can generate it, and with every new iteration of the AI, the internal code will get better. What matters now are the prompts.
I always thought TDD was garbage, but now with AI it's the only thing that makes sense. The code itself doesn't matter at all, the only thing that matters is the tests that will prove to the AI that their code is good enough. It can be dogshit code but if it passes all the tests, then it's "good enough". Then, just wait a few months and then rerun the code generation with a new version of the AI and the code will be better. The humans don't need to know what the code actually is. If they find a bug, write a new test and force the AI to rewrite the code to include the new test.
I think TDD has really found its future now that AI coding is here to stay. Human code doesn't matter anymore and in fact I would wager that modifying AI generated code is as bad and a burden. We will need to make sure the test cases are accurate and describe what the AI needs to generate, but that's it.
Try to do TDD with graphics programming.
I never bought into TDD because it is only usefull for business logic, plain algorithms and data structures, it is no accident that is what 99% of conference talks and books focus on.
There isn't a single TDD talk about shader programming for GPGPU, and validating that what the shader algorithms produce via automated tests, the reason being the amount of enginneering effort only to make it work, and still lacks human sensitivity for what gets rendered.
This is incorrect for a lot of reasons, many of which have already been explored, but also:
> with every new iteration of the AI, the internal code will get better
This is a claim that requires proof; it cannot just be asserted as fact. Especially because there's a silent "appreciably" hidden in there between "get" and "better" which has been less and less apparent with each new model. In fact, it more and more looks like "Moore's law for AI" is dead or dying, and we're approaching an upper limit where we'll need to find ways to be properly productive with models only effectively as good as what we already have!
Additionally, there's a relevant adage in computer science: "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." If the code being written is already at the frontier capabilities of these models, how the hell are they supposed to fix the bugs that crop up, especially if we can't rely on them getting twice as smart? ("They won't write the bugs in the first place" is not a realistic answer, btw.)
The argument they are making is that if a bug is discovered, the agent will not debug it, instead a new test case is created, and the code is regenerated (I suppose if a quick fix isn't found). That is why they don't need debugging agent twice as capable as coding agent. I don't know if this works in practice, as in my experience, tests are intertwined with the code base.
While TDD can have some merits, I think this is being way to generous to the value of tests. As Dijkstra said once, "Testing shows the presence, not the absence of bugs." I'm not a devout follower of Uncle Bob, but I was just thumbing through Clean Architecture today and he has a whole section to this point (including the above quote). Right after that quote he writes, "a program can be proven incorrect by a test, but it can not be proven correct." Which is largely true. The only garuntee of TDD is you can show a set of behaviors your program doesn't do, it never proves what the program actually does. To extrapolate to here, all TDD does it put up guardrails for the the AI should not generate.
It depends on how you define testing now: Property-based testing would test sets of behaviors. The main idea is: Formalize your goal before implementing. So specification driven development would be the thing to aim for. And at some point we might be able to model check (proof) the code that has been generated. Then we are the good old idea of code synthesis.
Don't worry, you're going to be searching for logic vs requirements mismatches instead if the thing provides proofs.
That means, you have to understand if it is even proving the properties you require for the software to work.
It's very easy to write a proof akin to a test that does not test anything useful...
> Then, just wait a few months and then rerun the code generation with a new version of the AI and the code will be better.
How many times have you seen a code change that “passed all the tests” take down production or break an important customer’s workflow?
Usually that was just a relatively small change.
Now imagine that you regenerated literally all the code.
The code is the spec. Any other spec comprehensive enough to cover all possible functionality has to be at least as complex as the code.
TDD is testing in production in disguise. After all, bugs are unexpected and you can’t write tests for a bug you don’t expect. Then the bug crops up in production and you update the test suite.
TDD has always been about two things for me; be able to move forward faster because I have something easy to execute that compares it against the known wanted state, and in the future preventing unwanted regressions. I'm not sure I've ever thought of unit testing as "prevent potential future bugs", mostly up front design prevents that, or I'd use property testing, but neither of those are inside the whole "write test then write code" flow.
The intended workflow of TDD is to write a set of tests before some code. The only reason that makes sense conceptually is to prevent possible future bugs from going undetected.
Put another way if your TDD always pass then there’s no point in writing them, and there’s no known bugs before you have any code. So discovering future bugs that didn’t exist when you’re writing those tests is the point.
I don’t really understand how to write tests before the code… When I write code, the hard part is writing the code which establishes the language to solve the problem in, which is the same language the tests will be written in. Also, once I have written the code I have a much better understanding as the problem, and I am in a way better position to write the correct tests.
I am not really sure if TDD often is compatible with modern agile development. It lends well to more waterfall style. Or clearly defined systems.
If you can design fully what your system does before starting it is more reasonable. And often that means going down to level of are inputs and states. Think more of something like control systems for say mobile networks or planes or factory control. You could design whole operation and all states that should happen or could happen before single line of code.
TDD fits better when you use a bottom up style of coding.
For a simple example, FuzzBuzz as a loop that has some if statements inside is not so easy to test. Instead break it in half so you have a function that does the fiddly bits and a loop that just contains “output += MakeFizzBizzLineForNumeber(X);” Now it’s easy to come up tests for likely mistakes and conceptually you’re working with two simpler problems with clear boundaries between them.
In a slightly different context you might have a function that decides which kind of account to create based on some criteria which then returns the account type rather than creating the account. That function’s logic is then testable by passing in some parameters and then looking at the type of account returned without actually creating any accounts. Getting good at this requires looking at programs in a more abstract way, but a secondary benefit is rather easy to maintain code at the cost of a little bookkeeping. Just don’t go overboard, the value is breaking out bits that are likely to contain bugs at some point where abstraction for abstraction’s sake is just wasted effort.
That's great for rote work, simple CRUD, and other things where you already know how the code should work so you can write a test first. Not all programming works well that way. I often have a goal I want to achieve, but no clue exactly how to get there at first. It takes quite a lot of experimentation, iteration and refinement before I have anything worth testing - and I've been programming 40+ years, so it's not because I don't know what I'm doing.
You write the requirements, you write the spec, etc. before you write the code.
You then determine what are the inputs / outputs that you're taking for each function / method / class / etc.
You also determine what these functions / methods / classes / etc. compute within their blocks.
Now you have that on paper and have it planned out, so you write tests first for valid / invalid values, edge cases, etc.
There are workflows that work for this, but nowadays I automate a lot of test creation. It's a lot easier to hack a few iterations first, play with it, then when I have my desired behaviour I write some tests. Gradually you just write tests first, you may even keep a repo somewhere for tests you might use again for common patterns.
AMD, at the time it decided to apply ACL2 to its FPU, had 29 million tests (not lines of code, but test inputs and outputs). ACL2 verification found several bugs in the FPU.
Just to make a couple of points for someone to draw a line.
The code always matters. Black box coding like this leads to systems you can't explain, and that's your whole damn job: to understand the system you're building. Anything less is negligence.
If the code doesn't matter anymore, in order of it to be of any quality the test should be as detailed as was the code in the first place, you'd end up writing the code in tests more or less.
The problem is — nobody commits code that fails tests.
The bugs occur because the initial tests didn’t fully capture the desired and undesired behaviors.
I’ve never seen a formal list of software requirements state that a product cannot take more than an hour to do a (trivial) operation. Nobody writes that out because it’s implicitly understood.
Imagine writing a “life for dummies” textbook on how to grow from a 5yr old to 10yr old. It’s impossible to fully cover.
TDD doesn’t ensure the code is maintainable, extendable, follows best practices, etc, and while AI might write some code that can pass tests while the code is relatively small, I would expect in the long run it will find it extremely difficult to just “rewrite everything based on this set of new requirements” and then do that again, and again, and again, each time potentially choosing entirely different architectures for the solution.
> TDD doesn’t ensure the code is maintainable, extendable, follows best practices, etc, and while AI might write some code
None of that matters of its not a person writing the code
AI has a hard time working with code that humans would consider hard to maintain and hard to extend.
If you give AI a set of tests to pass and turn it loose with no oversight, it will happily spit out 500k LOC when 500 would do. And then it will have a very hard time when you ask it to add some functionality.
AI routinely writes code that is beyond its ability to maintain and extend. It can’t just one shot large code bases either, so any attempt to “regenerate the code” is going to run into these same issues.
> If you give AI a set of tests to pass and turn it loose with no oversight, it will happily spit out 500k LOC when 500 would do. And then it will have a very hard time when you ask it to add some functionality.
I've been playing around with getting the AI to write a program, where I pretend I don't know anything about coding, only giving it scenarios that need to work in a specific way. The program is about financial planning and tax computations.
I recently discovered AI had implemented four different tax predictions to meet different scenarios. All of them incompatible and all incorrect but able to pass the specific test scenarios because it hardcoded which one to use for which test.
This is the kind of mess I'm seeing in the code when AI is left alone to just meet requirements without any oversight on the code itself.
No.
The reason AI code generation works so well is a) it is text based- the training data is huge and b) the output is not the final result but a human readable blueprint (source code), ready to be made fit by a human who can form an abstract idea of the whole in his head. The final product is the compiled machine code, we use compilers to do that, not LLMs.
Ai genereted code is not suitable to be directly transferred to the final product awaiting validation by TDD, it would simply be very inefficient to do so.
> code itself has an exact value of $0. That's because AI can generate it
That's only true for problems that has been solved and well documented before. AI can't solve novel problems. I have ton of examples I use from time to time when new models come out. I've tried to ride the hype train, and I've been frustrated working with people before, but I've never been so frustrated as trying to make AI follow simple set of rules and getting:
My favorite example is tasked AI with a rudimentary task and it gave me a working answer but it was fishy, so I googled the answer and lo and behold I landed on stackoverflow page with exact same answer being top voted answer to question very similar to my task. But that answer also had a ton of comments explaining why you never should do it that way.I've been told many times that "you know, kubernetes is so complicated, but I tell AI what I want and it gives me a command I simply paste in my terminal". Fuck no.
AI is great for scaffolding projects, working with typical web apps where you have repeatable, well documented scenarios, etc.
But it's not a silver bullet.
I remember talking about this with a friend a long time ago. Basically, you'd write up tests and there was a magic engine that would generate code that would self-assemble and pass tests. There was no guarantee that the code would look good or be efficient--just that it passed the tests.
We had no clue that this could actually happen one day in the form of gen AI. I want to agree with you just to prove that I was right!
This is going to bring up a huge issue though: nailing requirements. Because of the nature of this, you're going to have to spec out everything in great detail to avoid edge cases. At that point, will the juice be worth the squeeze? Maybe. It feels like good businesses are thorough with those kinds of requirements.
you will end up with something that passes all your tests then smashes into the back of the lorry the moment it sees anything unexpected
writing comprehensive tests is harder than writing the code
AI can help here too, by exploding the spec into a series of questions to clarify behavior.
Today, it just does something and when corrected it says "You are right!....".
Then you write another test. That's the whole point of TDD. As you keep writing more tests, the closer it gets to its final form.
Have you ever seen someone carve the inverse of a statue from a solid block of stone? If so, they are doing TDD.
Yeah, me neither…
The idea of TDD is that you should have the tests before you have the code. If your code is failing in real life before you have the tests, that's no longer TDD.
right, and by the time I have 2^googolplex tests then the "AI" will finally be able to produce a correctly operating hello world
oh no! another bug!
I've definitely seen a number of files where the implementation is maybe like 500 LOC and the test file is 10000+ LOC.
I agree rigidly defining exactly what the code does through tests is harder than people think.
I mostly agree, but why stop at tests? Shouldn’t it be spec driven development? Then neither the code or the language matter. Wouldn’t user stories and requirements à la bdd (see cucumber) be the right abstraction?
Natural language is too ambiguous for this, which makes it impossible to automatically verify
What you need is indeed spec-driven development, but specs need to be written in some kind of language that allows for more formal verification. Something like https://en.wikipedia.org/wiki/Design_by_contract, basically.
It is extremely ironic that, instead, the two languages that LLMs are the most proficient in - and thus the ones most heavily used for AI coding - are JavaScript and Python...
Maybe one day. I find myself doing plenty of course correction at the test level. Safely zooming out doesn't feel imminent.
I don't think you're wrong but I feel like there's a big bridge between the spec and the code. I think the tests are the part that will be able to give the AI enough context to "get it right" quicker.
It's sort of like a director telling an AI the high level plot of a movie, vs giving an AI the actual storyboards. The storyboards will better capture the vision of the director vs just a high level plot description, in my opinion.
Why stop there? Whichever shareholders flood the datacenter with the most electrical signals get the most profits.
This is wrong in so many ways. Have you even tried what you believe? If you have tried, you would find out it is nonsense quickly.
The irony is that I tried this with a project I've been meaning to bang out for years, and I think the OP's idea a natural thought to have when working with LLMs: "what if TTD but with LLMs"
When I tried it, it "worked", I admittedly felt really good about it, but I stepped away for a few weeks because of life and now I can't tell you how it works beyond the high level concepts I fed into the LLM.
When there's bugs, I basically have to derive from first principles where/how/why the bug happens instead of having good intuition on where the problem lies because I read/wrote/reviewed/integrated with the code myself.
I've tried this method of development with various levels of involvement in implementation itself and the conclusion I came to is if I didn't write the code, it isn't "mine" in every sense of the term, not just in terms of legal or moral ownership, but also in the sense of having a full mental model of the code in a way I can intellectually and intuitively own it.
Really digging into the tests and code, there are fundamental misunderstandings that are very, very hard to discern when doing the whole agent interfacing loop. I believe they're the types of errors you'd only pick up on if you wrote the code yourself, you have to be in that headspace to see the problem.
Also, I'd be embarrassed to put my name on the project, given my lack of implementation, understanding and the overall quality of the code, tests, architecture, etc. It isn't honest and it's clearly AI slop.
It did make me feel really productive and clever while doing it, though.
Not everything can be tested by a computer.
> We will need to make sure the test cases are accurate and describe what the AI needs to generate, but that's it.
Yes. The first thing I always check in every project (an especially vibe-coded projects) is whether if:
A. Does it have tests?
B. Is the coverage over 70%?
C. Do the tests actually test for the behaviour of the code (good) or just its implementation (bad.)
If any of those requirements are missing, then that is a red flag for the project.
While TDD is absolutely valuable for clean code, focusing too much on it can be the death of a startup.
As you said the code itself is $0, then the first product is still worth $10 and the finished product is worth $1M+ once it makes money, which is what matters.
Trust but verify. It's not hard.
The corollary being. If you can't (through skill or effort) verify don't trust.
If you break this pattern you deserve all the follies that become you as a "professional".
This is an interesting take — shifting focus from “writing the best code” to “defining the right tests” makes sense in an AI-driven world. But I’m skeptical if treating the generated code as essentially disposable is wise — tests can catch a lot, but they won’t automatically enforce readability, maintainability, or ensure unexpected behaviors don’t slip through
> Instead, we use an approach where a human and AI agent collaborate to produce the code changes. For our team, every commit has an engineer's name attached to it, and that engineer ultimately needs to review and stand behind the code. We use steering rules to setup constraints for how the AI agent should operate within our codebase,
This sounds a lot like Tesla's Fake Self Driving. It self drives right up to the crash, then the user is blamed.
Except here it's made abundantly clear, up front, who has responsibility. There's no pretense that it's fully self driving. And the engineer has the power to modify every bit of that decision.
Part of being a mature engineer is knowing when to use which tools, and accepting responsibility for your decisions.
It's not that different from collaborating with a junior engineer. This one can just churn out a lot more code, and has occasional flashes of brilliance, and occasional flashes of inanity.
> Except here it's made abundantly clear, up front, who has responsibility.
By the people who are disclaiming it, yes.
Idk it’s hard to say it’s called “Full Self Driving” and then the CEO says as much.
It's amazing that their metrics exactly match the mythical "10x engineer" in productivity boost.
This is the first time I see "steering rules" mentioned. I do something similar with Claude, curious how it looks for them and how they integrate it with Q/Kiro.
Those rules are often ignored by agents. Codex is known to be quite adhering, but it falls back to its own ideas, which run counter to rules I‘ve given it. The longer a session goes on, the more it goes off the rails.
I'm aware of the issues around rules as in a default prompt. I had hoped the author of the blog meant a different mechanism when they mentioned "steering rules". I do mean something different, where an agent will self-correct when it is seen going against rules in the initial prompt. I have a different setup myself for Claude Code, and would call parts of that "steering"; adjusting the trajectory of the agent as it goes.
With Claude Code, you can intercept its prompts if you start it in a wrapper and mock fetch (someone with github user handle „badlogic“ did this, but I can’t find the repo now). For all other things (and codex, Cursor) you‘d need to proxy/isolate all comms with the system heavily.
Everything related to LLMs is probabilistic, but those rules are also often followed well by agents.
Yes they do, most of the time. Then they don’t. Yesterday, I told codex that it must always run tests by invoking a make target. That target is even configurable w/ parameters, eg to filter by test name. But always, at some point in the session, codex started disregarding that rule and fell back to using the platform native test tool directly. I used strong language to steer it back, but 20% or so of context later, it did that again.
"steering rules" is a core feature baked into Kiro. It's similar to the spec files use in most agentic workflows but you can use exclusion and inclusion rules to avoid wasting context.
There's currently not an official workflow on how to manage these steering files across repos if you want to have organisation-wide standards, which is probably my main criticism.
I'd assume it's related to this Amazon "Socratic Human Feedback (SoHF): Expert Steering Strategies for LLM Code Generation" paper: https://assets.amazon.science/bf/d7/04e34cc14e11b03e798dfec5...
This article is right, but I think it may underplay the changes that could be coming soon. For instance, as the top comment here about TDD points out, the actual code does not matter anymore. This is an astounding claim! And it has naturally received a lot of objections in the replies.
But I think the objections can mostly be overcome with a minor adjustment: You only need to couple TDD with a functional programming style. Functional programming lets you tightly control the context of each coding task, which makes AI models ridiculously good at generating the right code.
Given that, if most of your code is tightly-scoped, well-tested components implementing orthogonal functionality, the actual code within those components will not matter. Only glue code becomes important and that too could become much more amenable to extensive integration testing.
At that point, even the test code may not matter much, just the test-cases. So as a developer you would only really need to review and tweak the test cases. I call this "Test-Case-Only Development" (TCOD?)
The actual code can be completely abstracted away, and your main task becomes design and architecture.
It's not obvious this could work, largely because it violates every professional instinct we have. But apparently somebody has even already tried it with some success: https://www.linkedin.com/feed/update/urn:li:activity:7196786...
All the downsides that have been mentioned will be true, but also may not matter anymore. E.g. in a large team and large codebase, this will lead to a lot of duplicate code with low cohesion. However, if that code does what it is supposed to and is well-tested, does the duplication matter? DRY was an important principle when the cost of code was high, and so you wanted to have as much leverage as possible via reuse. You also wanted to minimize code because it is a liability (bugs, tech debt, etc.) and testing, which required even more code that still didn't guarantee lack of bugs, was also very expensive.
But now that the cost of code is plummeting, that calculus is shifting too. You can churn out code and tests (including even performance tests, which are always an afterthought, if thought of at all) at unimaginable rates.
And all this while reducing the dependencies of developers on libraries and frameworks and each other. Fewer dependencies means higher velocity. The overall code "goodput" will likely vastly outweight inefficiences like duplication.
Unfortunately, as TFA indicates, there is a huge impedance mismatch with this and the architectures (e.g. most code is OO, not functional), frameworks, and processes we have today. Companies will have to make tough decisions about where they are and where they want to get.
I suspect AI-assisted coding taken to its logical conclusion is going to look very different from what we're used to.
The biggest thing that stood out to me was that they suddenly started working nonstop, even on weekends…? If AI is so great, why can’t they get a single day off in two months?
But here's the critical part: the quality of what you are creating is way lower than you think, just like AI-written blog posts.
Upvoted for dig that is also an accurate and insightful metaphor.
Absolutely none of that article has ever even so much as brushed past the colloquial definition of "calculus".
These guys actually seem rattled now.
Well, 'calculus' is the kind of marketing word that sounds more impressive than 'arithmetic' and I think 'quantum logic' has gone a bit stale, and 'AI-based' might give more hope to the anxious investor class, as 'AI-assisted' is a bit weak as it means the core developer team isn't going to be cut from the labor costs on the balance sheet, they're just going to be 'assisted' (things like AI-written unit tests that still need some checking).
"The Arithmetic of AI-Assisted Coding Looks Marginal" would be the more honest article title.
Yes, unfortunately a phrase that's used in an attempt to lend gravitas and/or intimidate people. It sort of vaguely indicates "a complex process you wouldn't be interested in and couldn't possibly understand". At the same time it attempts to disarm any accusation of bias in advance by hinting at purely mechanistic procedures.
Could be the other way around, but I think marketing-speak is taking cues here from legal-ese and especially the US supreme court, where it's frequently used by the justices. They love to talk about "ethical calculus" and the "calculus of stare decisis" as if they were following any rigorous process or believed in precedent if it's not convenient. New translation from original Latin: "we do what we want and do not intend to explain". Calculus, huh? Show your work and point to a real procedure or STFU
"Galaxy-brain pair programming with the next superintelligence"
Classic LLM article:
1) Abstract data showing an increase in "productivity" ... CHECK
2) Completely lacking in any information on what was built with that "productivity" ... CHECK
Hilarious to read this on the backend of the most widely publicized AWS failure.
Please don't post shallow, snarky dismissals on HN. The guidelines ask us to be more thoughtful in the way we respond to things:
https://news.ycombinator.com/newsguidelines.html
Yep. The problem is then leadership sees this and says "oh, we too can expect 10x productivity if everyone uses these tools. We'll force people to use them or else."
And guess what happens? Reality doesn't match expectations and everyone ends up miserable.
Good engineering orgs should have engineers deciding what tools are appropriate based on what they're trying to do.
Another day, and another smart person finally discovers the benefits of leveraging AI to write code.
Correct TDD involves solving all the hard problems in the process. What gain does AI give you then?
If you are producing real results at 10x then you should be able to show that you are a year ahead of schedule in 5 weeks.
Waiting to see anyone show even a month ahead of schedule after 6 months.
I've never worked anywhere that knew where they were going well enough that it was even possible to be a month ahead of schedule. By the time a month has elapsed the plan is entirely different.
AI can't keep up because its context window is full of yesteryear's wrong ideas about what next month will look like.
Yeah, this is the main problem. Writing of code just isn't the bottle neck. It's the discovery of the business case that is the hard part. And if you don't know what it is, you can't prompt your way out of it.
We've been having a go around with corporate leadership at my company about "AI is going to solve our problems". Dude, you don't even know what our problems are. How are you going to prompt the AI to analyze a 300 page PDF on budget policy when you can't even tell me how you read a 300 page PDF with your eyes to analyze the budget policy.
I'm tempted to give them what they want: just a chatter box they can ask, "analyze this budget policy for me", just so I can see the looks on their faces when it spits out five poorly written paragraphs full of niceties that talk its way around ever doing any analysis.
I don't know, maybe I'm too much of a perfectionist. Maybe I'm the problem because I value getting the right answer rather than just spitting out reams of text nobody is ever going to read anyway. Maybe it's better to send the client a bill and hope they are using their own AIs to evaluate the work rather than reading it themselves? Who would ever think we were intentionally engaging in Fraud, Waste, and Abuse if it was the AI that did it?
> I'm tempted to give them what they want: just a chatter box they can ask, "analyze this budget policy for me", just so I can see the looks on their faces when it spits out five poorly written paragraphs full of niceties that talk its way around ever doing any analysis.
Ah, but they'll love it.
> I don't know, maybe I'm too much of a perfectionist. Maybe I'm the problem because I value getting the right answer rather than just spitting out reams of text nobody is ever going to read anyway. Maybe it's better to send the client a bill and hope they are using their own AIs to evaluate the work rather than reading it themselves? Who would ever think we were intentionally engaging in Fraud, Waste, and Abuse if it was the AI that did it?
We're already doing all the same stuff, except today it's not the AI that's doing that, it's people. One overworked and stressed person somewhere makes for a poorly designed, buggy library, and then millions of other overworked and stressed people spend most of their time at work finding out how to cobble dozens of such poorly designed and buggy pieces of code together into something that kinda sorta works.
This is why the top management is so bullshit on AI. It's because it's a perfect fit for a model that they have already established.
I've got my own gripes about leadership, but I'm finding that even when its a goal I've set for myself, watching an AI fail at it represents a refinement of what I thought I wanted: I'm not much better than they are.
That, or its a discovery of why what I wanted is impossible and it's back to the drawing board.
It's nice to not be throwing away code that I'd otherwise have been a perfectionist about (and still thrown away).
Looking at the “metrics” they shared, going from committing just about zero code over the last two years to more than zero in the past two months may be a 10x improvement. I haven’t seen any evidence more experienced developers see anywhere near that speedup.
The "metrics" is hilarious. The "before AI" graph looks like those meme about FAANG engineers who sit around and basically do nothing.
when copilot auto-completes 10 lines with 99% accuracy with a press of tab, does that not save you time typing those lines?
Sure, but probably your pre-copilot IDE was autocompleting 7-8 of those lines anyway, just by playing type tetris, and typing the code out was never the slow part?
These are the kind of people that create two letter aliases to avoid typing “git pull” or whatever. Congrats, very efficient, saving 10 seconds per day.
So, yeah, they probably think typing is a huge bottle neck and it’s a huge time saver.
What if I told you the hard part is not the typing.
Is typing 10 lines of code your bottleneck?
Why would that save me a significant amount of time versus writing the code myself means I don't have to spend a bunch of time analyzing it to figure it what it does?
"We have real mock versions of all our dependencies!"
Congratulations, you invented end-to-end testing.
"We have yellow flags when the build breaks!"
Congratulations! You invented backpressure.
Every team has different needs and path dependencies, so settles on a different interpretation of CI/CD and software eng process. Productizing anything in this space is going to be an uphill battle to yank away teams' hard-earned processes.
Productizing process is hard but it's been done before! When paired with a LOT of spruiking it can really progress the field. It's how we got the first CI/CD tools (eg. https://en.wikipedia.org/wiki/CruiseControl) and testing libraries (eg. pytest)
So I wish you luck!
"For me, roughly 80% of the code I commit these days is written by the AI agent" Therefore, it is not commited by you, but by you in the name of AI agent and the holy slop. What to say, I hope that 100x productivity is worth it and you are making tons of money. If this stuff becomes mainstream, I suggest open source developers stop doing the grind part, stop writing and maintaining cool libraries and just leave all to the productivity guys, let's see how far they get. Maybe I've seen too many 1000x hacker news..
Just need the feedback to follow suit to be 100x as effective. Tests, docs and rapid loops of guidance with human in the loop. Split your tasks, find the structure that works.
I think it's fine. For example, "I" made this library https://github.com/anchpop/weblocks . It might be more accurate to say that I directed AI to make it, because I didn't write a line of code myself. (And I looked at the code and it is truly terrible.) But I tested that it works, and it does, and it solves my problem perfectly. Yes, it is slop, but this is a leaf node in the abstraction graph, and no one needs to look at it again now that it it written
Most code, though, is not write once and ignore. So it does matter if its crap, because every piece of software is only as good as its least dependency.
Fine for just you. Not fine for others, not fine for business, not fine the moment you star count starts moving.
Interesting enough to me though I only skimmed.
I switched back to Rails for my side project a month ago and ai coding when doing not too complex stuff has been great. While the old NextJS code base was in shambles.
Before I was still doing a good chunk of the NextJS coding. I’m probably going to be directly coding less than 10% of the code base from here on out. I’m now spending time trying to automate things as much as possible, make my workflow better, and see what things can be coded without me in the loop. The stuff I’m talking about is basic CRUD and scraping/crawling.
For serious coding, I’d think coding yourself and having ai as your pair programmer is still the way to go.
Lots of reasonable criticisms being down-voted here. Are we being AstroTurfed? Is HN falling victim to the AI hype train money too now?
I'm downvoting most of the criticisms because in general they can be summarized as "all AI is slop". In my experience that simply isn't true.
This article attempted to outline a fairly reasonable approach to using AI tooling, and the criticisms hardly seem related to it at all.
first the Microsoft guy touting agents
now AWS guy doing it !
"My team is no different—we are producing code at 10x of typical high-velocity team. That's not hyperbole - we've actually collected and analyzed the metrics."
Rofl
"The Cost-Benefit Rebalance"
In here he basically just talks about setting up mock dependencies and introducing intermittent failures into them. Mock dependencies have been around for decades, nothing new here.
It sounds like this test system you set up is as time consuming as solving the actual problems you're trying to solve, so what time are you saving?
"Driving Fast Requires Tighter Feedback Loop"
Yes if you're code-vomiting with agents and your test infrastructure isn't rock solid things will fall apart fast, that's obvious. But setting up a rock solid test infrastructure for your system involves basically solving most of the hard problems in the first place. So again, what? What value are you gaining here?
"The communication bottleneck"
Amazon was doing this when I worked there 12 years ago. We all sat in the same room.
"The gains are real - our team's 10x throughput increase isn't theoretical, it's measurable."
Show the data and proof. Doubt.
Yeah I don't know. This reads like complete nonsense honestly.
Paraphrasing: "AI will give us huge gains, and we're already seeing it. But our pipelines and testing will need to be way stronger to withstand the massive increase in velocity!"
Velocity to do what? What are you guys even doing?
Amazon is firing 30,000 people by the way.
We're back to using LOC as a productivity metric because LLMs are best at cranking out thousands of LOC really fast. Personal experience I had a colleague use Claude Code top create a PR consisting of a dozen files and thousands of line of code for something that could have been done in a couple hundred LOC in a single file.
> We're back to using LOC as a productivity metric because LLMs are best at cranking out thousands of LOC really fast.
Can you point me to anyone who knows what they're talking about declaring that LOC is the best productivity metric for AI-assisted software development?
Are you implying that the author of this article doesn't know what they are talking about? Because they basically declared it in the article we just read.
Can you point me to where the author of this article gives any proof to the claim of 10x increased productivity other than the screenshot of their git commits, which shows more squares in recent weeks? I know git commits could be net deleting code rather than adding code, but that's still using LOC, or number of commits as a proxy to it, as a metric.
> I know git commits could be net deleting code rather than adding code…
Yes, I'm also reading that the author believes commit velocity is one reflection of the productivity increases they're seeing, but I assume they're not a moron and has access to many other signals they're not sharing with us. Probably stuff like: https://www.amazon.science/blog/measuring-the-effectiveness-...
I had a coworker use Copilot to implement tab indexing through a Material UI DataGrid. The code was a few hundred lines. I showed them a way to do it in literally one line passed in the slot properties.
"Our testing needs to be better to handle all this increased velocity" reads to me like a euphemistic way of saying "we've 10x'ed the amount of broken garbage we're producing".
if you've ever had a friend that you knew before, then they went to work at amazon, it's like watching someone get indoctrinated into a cult
and this guy didn't survive there for a decade by challenging it
TLDR: ai changes the economic calculus of software development. It makes automated testing more beneficial in comparison to costs.
I think he is right.
This reads like "Hey, we're not vibe coding, but when we do, we're careful!" with hints of "AI coding changes the costs associated with writing code, designing features, and refactoring" sprinkles in to stand out.