Having used several external orchestrators I can see the appeal of the simplicity of this approach, especially for smaller teams wanting to limit the amount of infrastructure to maintain. Postgres is a proven tool, and as long as you design `@step`s to each perform one side effect, I can see this scaling very well both in terms of performance and maintainability.
Durable execution is best done at the level of a language implementation, not as a library.
A workflow engine I recently built provided an interpreter for a Scheme-based language that, for each blocking operation, took a snapshot of the interpreter state (heap + stack) and persisted that to a database. Each time an operation completes (which could be after hours/days/weeks), the interpreter state is restored from the database and execution proceeds from the point at which it was previously suspended. The interpreter supports concurrency, allowing multiple blocking operations to be in progress at the same time, so the work to be done after the completion of one can proceed even while others remain blocked.
The advantage of doing this at the language level is that persistence becomes transparent to the programmer. No decorators are needed; every function and expression inherently has all the properties of a "step" as described here. Deterministic execution can be provided if needed. And if there's a need to call out to external code, it is possible to expose Python functions as Scheme built-ins that can be invoked from the interpreter either synchronously or asynchronously.
I see a lot of workflow engines released that almost get to the point of being like a traditional programming language interpreter but not quite, exposing the structure of the workflow using a DAG with explicit nodes/edges, or (in the case of DBOS) as decorators. While I think this is ok for some applications, I really believe the "workflow as a programming language" perspective deserves more attention.
There's a lot of really interesting work that's been done over the years on persistent systems, and especially orthogonal persistence, but sadly this has remained confined to the research literature. Two real-world systems that do implement persistence at the language level are Ethereum and Smalltalk; also some of the older Lisp-based systems provided similar functionality. I think there's a lot more value waiting to be mined from these past efforts.
This seems like temporal only without as much server and complexity. Maybe they ignore it or it really is that simple.
Overall really cool! There are some scalability concerns that are brought that I think are valid but maybe you have a Postgres server backing up every few servers that need this kind of execution. Also, every function shouldn't be its own step but needs to be divided into larger chunks where every request only generates <10 steps.
The example is overly simplified. It glosses over many of the subtle-but-important aspects of durable execution.
For example:
- Steps should be small but fallible operations - eg. sending a request to an external service. You generally want to tailor the retry logic on steps to the specific task they are doing. Doing too much in a step can increase failure rates or cause other problems due to the at-least-once behaviour of steps.
- The article makes a big deal of "Hello" being printed 5 times in the event of a crash, but durable execution doesn't guarantee this! You can never have exactly-once guarantees of side-effectful functions like this without cooperation from the other side. For example, if the external service supports idempotency via request IDs then you can generate an ID in a separate step and then use that in your request to get exactly-once behaviour. However, most services don't offer this. Crashes during a step will cause the step to re-run, so durable execution only gives you at least once behaviour.
- Triggering the workflow itself is a point of failure. In the example, the workflow decorator generates an ID for the workflow internally, but for triggering a workflow exactly once the workflow ID needs to be externally generated.
- The solution is light-weight in terms of infrastructure, but not at all lightweight in terms of performance.
My reaction is "No way, not again !!". I personally done this internal orchestration at scale at a large enterprise spanning millions of execution and it has scalability problem. we eventually externalized it to bring back sanity.
I don’t think what you are describing as heavy is that big of a deal if an external orchestration system is required only for deployment, while the workflow can be developed and tested without a server on a laptop or notebook.
Bringing in orchestration logic in the app layer means there is more code being bundled with the app, which has its own set of tradeoffs - like bringing in a different set of code dependencies which might conflict with application code.
In 2025, I would be surprised if a good workflow engine didn’t have a completely server-less development mode :)
> In some sense, external orchestration turns individual applications into distributed microservices, with all the complexity that implies.
I'd argue that durable execution intrisically is complex and external orchestrators give you tools to manage that complexity, whereas this attempts to brush the complexity under the rug in a way that does not inspire confidence.
I think the example given in this blog post might need a "health warning" that steps should, generally, be doing more than just printing "hello".
I can imagine that the reads and writes to Postgres for a large number of workflows, each with a large number of small steps called in a tight loop, would cause some significant performance problems.
The examples given on their main site are a little more meaningful.
Having used several external orchestrators I can see the appeal of the simplicity of this approach, especially for smaller teams wanting to limit the amount of infrastructure to maintain. Postgres is a proven tool, and as long as you design `@step`s to each perform one side effect, I can see this scaling very well both in terms of performance and maintainability.
Durable execution is best done at the level of a language implementation, not as a library.
A workflow engine I recently built provided an interpreter for a Scheme-based language that, for each blocking operation, took a snapshot of the interpreter state (heap + stack) and persisted that to a database. Each time an operation completes (which could be after hours/days/weeks), the interpreter state is restored from the database and execution proceeds from the point at which it was previously suspended. The interpreter supports concurrency, allowing multiple blocking operations to be in progress at the same time, so the work to be done after the completion of one can proceed even while others remain blocked.
The advantage of doing this at the language level is that persistence becomes transparent to the programmer. No decorators are needed; every function and expression inherently has all the properties of a "step" as described here. Deterministic execution can be provided if needed. And if there's a need to call out to external code, it is possible to expose Python functions as Scheme built-ins that can be invoked from the interpreter either synchronously or asynchronously.
I see a lot of workflow engines released that almost get to the point of being like a traditional programming language interpreter but not quite, exposing the structure of the workflow using a DAG with explicit nodes/edges, or (in the case of DBOS) as decorators. While I think this is ok for some applications, I really believe the "workflow as a programming language" perspective deserves more attention.
There's a lot of really interesting work that's been done over the years on persistent systems, and especially orthogonal persistence, but sadly this has remained confined to the research literature. Two real-world systems that do implement persistence at the language level are Ethereum and Smalltalk; also some of the older Lisp-based systems provided similar functionality. I think there's a lot more value waiting to be mined from these past efforts.
This seems like temporal only without as much server and complexity. Maybe they ignore it or it really is that simple.
Overall really cool! There are some scalability concerns that are brought that I think are valid but maybe you have a Postgres server backing up every few servers that need this kind of execution. Also, every function shouldn't be its own step but needs to be divided into larger chunks where every request only generates <10 steps.
The example is overly simplified. It glosses over many of the subtle-but-important aspects of durable execution.
For example:
- Steps should be small but fallible operations - eg. sending a request to an external service. You generally want to tailor the retry logic on steps to the specific task they are doing. Doing too much in a step can increase failure rates or cause other problems due to the at-least-once behaviour of steps.
- The article makes a big deal of "Hello" being printed 5 times in the event of a crash, but durable execution doesn't guarantee this! You can never have exactly-once guarantees of side-effectful functions like this without cooperation from the other side. For example, if the external service supports idempotency via request IDs then you can generate an ID in a separate step and then use that in your request to get exactly-once behaviour. However, most services don't offer this. Crashes during a step will cause the step to re-run, so durable execution only gives you at least once behaviour.
- Triggering the workflow itself is a point of failure. In the example, the workflow decorator generates an ID for the workflow internally, but for triggering a workflow exactly once the workflow ID needs to be externally generated.
- The solution is light-weight in terms of infrastructure, but not at all lightweight in terms of performance.
My reaction is "No way, not again !!". I personally done this internal orchestration at scale at a large enterprise spanning millions of execution and it has scalability problem. we eventually externalized it to bring back sanity.
Could you describe the scalability issue? Was it due to maintaining a bespoke execution engine in-house, the limits of RDBMS vertical scaling, other?
I don’t think what you are describing as heavy is that big of a deal if an external orchestration system is required only for deployment, while the workflow can be developed and tested without a server on a laptop or notebook.
Bringing in orchestration logic in the app layer means there is more code being bundled with the app, which has its own set of tradeoffs - like bringing in a different set of code dependencies which might conflict with application code.
In 2025, I would be surprised if a good workflow engine didn’t have a completely server-less development mode :)
My immediate reaction is hell no.
> In some sense, external orchestration turns individual applications into distributed microservices, with all the complexity that implies.
I'd argue that durable execution intrisically is complex and external orchestrators give you tools to manage that complexity, whereas this attempts to brush the complexity under the rug in a way that does not inspire confidence.
What value does an external orchestrator add for managing complexity that an “embedded” solution could not?
I think the example given in this blog post might need a "health warning" that steps should, generally, be doing more than just printing "hello".
I can imagine that the reads and writes to Postgres for a large number of workflows, each with a large number of small steps called in a tight loop, would cause some significant performance problems.
The examples given on their main site are a little more meaningful.