Use Go, it has built in go routines and likely libraries that let you implement your own workers.
If you’re running a single instance, you don’t even need any synchronization. If you’re running multiple instances of your app, try implementing locking (this actually works in any language, not just go. Go jsut helps with the multiple long running workers part. With other languages, just run multiple instances.
Process:
1. Each worker can startup with their own id, it can be a random uuid.
2. When you need to create a task, add it to the tasks table, do nothing else and exit.
3. Each worker running on some loop or cron, would set a lock on a subset of the tasks. Like:
update tasks set workerId = myUUID, lockUntil = now() +10minutes where (workerId is null or lockUntil < now()) and completed = false
Or you can do a select for update or w/e helps you keep other workers from setting their ids at the same time.
4. When this is done, pull all tasks assigned to your worker, execute, then clear the lock, and set to completed.
5. If your worker crashes, another will be able to pick it up after the lock expires.
No redis, no additional libraries, still distributed
Yeah, this is pretty much what I end up doing as well.
It works, but I keep rewriting the same task table / locking / retry logic in every project, which is why I'm wondering if it makes sense to move this out into a separate service.
Not sure if it's actually a real problem or just my workflow though.
I would create a library, make some logic more generic, create a generic table (task id, taskType, workerId, etc), store task metadata as jsonb so it can be pulled and marshalled into typed data by users.
Import it into your projects.
Make the library work standalone. But also build a task manager service that people can deploy if they want to run it outside their code.
Then offer a hosted solution that does the webhooks.
That makes sense, and this is actually close to what I keep ending up with in different projects.
I usually start with something simple, then add a task table, then locking, retries, then some kind of worker process, and eventually it turns into a small job system anyway.
At some point it starts feeling like I'm rebuilding the same queue/worker setup over and over, which is why I'm wondering if this should live outside the app entirely.
The Postgres-as-queue pattern works well past most small SaaS traffic levels. SELECT ... FOR UPDATE SKIP LOCKED (Postgres 9.5+) is the key primitive -- it lets multiple workers poll safely without deadlocks. Oban (Elixir), River (Go), GoodJob (Rails), and pg_boss (Node) are solid implementations.
Where Redis genuinely makes more sense: fan-out pub/sub to many subscribers, rate-limiting across distributed nodes, or burst volumes where Postgres lock contention shows up. For most indie SaaS those thresholds are rarely hit.
The HTTP callback model you are exploring is roughly what Inngest and Trigger.dev offer. It works, but you give up transactional job creation -- inserting into your jobs table in the same transaction as your domain write. Without that, an app crash between row saved and job enqueued creates silent failures.
The transactional enqueue issue is exactly what makes me unsure about the callback model.
With in-app queues you get the nice property that the job can be created in the same DB transaction, which is harder to keep once the runner lives outside the app.
I'm trying to understand if the simplicity of an external runner is worth that trade-off for smaller apps.
Other than that, yes, durable execution does all of this for you.
TLDR on Durable Endpoints: you can automatically use steps in API endpoints which checkpoint state in the BG, and then retry on failure. This means you can run jobs in the background _somewhat_ transactionally (somewhat because there's delay between checkpointing) to minimize any tradeoffs here. And, if you want full transactionality, don't buffer checkpoints in the BG and instead do it synchronously.
Also, Redis is good for medium scale load. We're hitting millions of RPS (aggregated) on our services (I work at Inngest) and it doesn't scale so well at this load, at all. We had to invest in other infra.
Durable execution seems to be exactly the problem space here.
Most approaches keep the executor inside the app and use the DB for coordination, which works but still means rebuilding similar infrastructure in every project.
I'm trying to understand if there's room for a simpler external runner for smaller apps.
> Most approaches keep the executor inside the app and use the DB for coordination
Actually the opposite. DBOS is the only one that does it that way. All the other durable execution solutions require an external coordinator.
I think your assessment is incorrect though. Using the app and DB does not mean building new infra for every project (your project already has an app and a DB). You simply import the library and it does the rest, and then you write your code normally. Every other solution requires building new, similar infrastructure for every project (the external coordinator).
The language-native library approach is the far simpler and more repeatable way to do it.
As another example, an AI with the right context can build a DBOS app, often in one shot. An AI will take a ton of work to build your terraform or YAML to set up the infra you need to use an external coordinator.
The beauty of the DBOS solution is that it works great for tiny apps and massive apps with billions of checkpoints, and everything in between.
Yes, I've seen a lot of Postgres-based queues lately too.
Even without Redis I still end up rebuilding some kind of job system on top of the DB, which is why I'm wondering if this should live outside the app entirely.
One approach that sidesteps the whole problem: design for fully synchronous, stateless requests from the start so there's nothing to queue.
I did this for a financial calculator API — every request is pure computation, inputs in, result out, nothing persisted. No Redis, no workers, no task table, no locking. The response is ready before a user would notice a queue anyway (sub-50ms).
Obviously only works when tasks complete in milliseconds. But figassis's pattern of "starts simple, then incrementally grows into a small job system anyway" often happens because the initial scope could have been fully synchronous — the async complexity creeps in before it's actually needed.
Worth asking first: does this task genuinely have to be async, or is it just easier to model it that way?
Yeah, that makes sense too. I also try to keep things synchronous as long as possible.
In practice async usually shows up once there are external APIs, retries, scheduling, or anything that shouldn't block the request, and that's where I end up building some kind of job system again.
I'm trying to figure out if that point happens often enough to justify moving this outside the app entirely.
Use Go, it has built in go routines and likely libraries that let you implement your own workers.
If you’re running a single instance, you don’t even need any synchronization. If you’re running multiple instances of your app, try implementing locking (this actually works in any language, not just go. Go jsut helps with the multiple long running workers part. With other languages, just run multiple instances.
Process:
1. Each worker can startup with their own id, it can be a random uuid.
2. When you need to create a task, add it to the tasks table, do nothing else and exit.
3. Each worker running on some loop or cron, would set a lock on a subset of the tasks. Like:
update tasks set workerId = myUUID, lockUntil = now() +10minutes where (workerId is null or lockUntil < now()) and completed = false
Or you can do a select for update or w/e helps you keep other workers from setting their ids at the same time.
4. When this is done, pull all tasks assigned to your worker, execute, then clear the lock, and set to completed.
5. If your worker crashes, another will be able to pick it up after the lock expires.
No redis, no additional libraries, still distributed
Yeah, this is pretty much what I end up doing as well.
It works, but I keep rewriting the same task table / locking / retry logic in every project, which is why I'm wondering if it makes sense to move this out into a separate service.
Not sure if it's actually a real problem or just my workflow though.
I would create a library, make some logic more generic, create a generic table (task id, taskType, workerId, etc), store task metadata as jsonb so it can be pulled and marshalled into typed data by users.
Import it into your projects.
Make the library work standalone. But also build a task manager service that people can deploy if they want to run it outside their code.
Then offer a hosted solution that does the webhooks.
I’m sure someone will want to pay for it.
That makes sense, and this is actually close to what I keep ending up with in different projects.
I usually start with something simple, then add a task table, then locking, retries, then some kind of worker process, and eventually it turns into a small job system anyway.
At some point it starts feeling like I'm rebuilding the same queue/worker setup over and over, which is why I'm wondering if this should live outside the app entirely.
Thanks, this discussion is really helpful.
The Postgres-as-queue pattern works well past most small SaaS traffic levels. SELECT ... FOR UPDATE SKIP LOCKED (Postgres 9.5+) is the key primitive -- it lets multiple workers poll safely without deadlocks. Oban (Elixir), River (Go), GoodJob (Rails), and pg_boss (Node) are solid implementations.
Where Redis genuinely makes more sense: fan-out pub/sub to many subscribers, rate-limiting across distributed nodes, or burst volumes where Postgres lock contention shows up. For most indie SaaS those thresholds are rarely hit.
The HTTP callback model you are exploring is roughly what Inngest and Trigger.dev offer. It works, but you give up transactional job creation -- inserting into your jobs table in the same transaction as your domain write. Without that, an app crash between row saved and job enqueued creates silent failures.
The transactional enqueue issue is exactly what makes me unsure about the callback model.
With in-app queues you get the nice property that the job can be created in the same DB transaction, which is harder to keep once the runner lives outside the app.
I'm trying to understand if the simplicity of an external runner is worth that trade-off for smaller apps.
Inngest's Durable Endpoints aim to solve the durable API problem without messy DB txns, all within some tolerance: https://www.inngest.com/docs/learn/durable-endpoints.
Other than that, yes, durable execution does all of this for you.
TLDR on Durable Endpoints: you can automatically use steps in API endpoints which checkpoint state in the BG, and then retry on failure. This means you can run jobs in the background _somewhat_ transactionally (somewhat because there's delay between checkpointing) to minimize any tradeoffs here. And, if you want full transactionality, don't buffer checkpoints in the BG and instead do it synchronously.
Also, Redis is good for medium scale load. We're hitting millions of RPS (aggregated) on our services (I work at Inngest) and it doesn't scale so well at this load, at all. We had to invest in other infra.
It’s a real problem you’re solving but the good news is that it’s already solved! You don’t have to build it yourself.
You’re looking for durable execution to solve your problem.
If you’re already running Postgres, check out DBOS[0]. It turns your app into its own durable executor using your database for coordination.
[0] https://github.com/dbos-inc/dbos-transact-golang
Thanks for the link, DBOS looks interesting.
Durable execution seems to be exactly the problem space here.
Most approaches keep the executor inside the app and use the DB for coordination, which works but still means rebuilding similar infrastructure in every project.
I'm trying to understand if there's room for a simpler external runner for smaller apps.
> Most approaches keep the executor inside the app and use the DB for coordination
Actually the opposite. DBOS is the only one that does it that way. All the other durable execution solutions require an external coordinator.
I think your assessment is incorrect though. Using the app and DB does not mean building new infra for every project (your project already has an app and a DB). You simply import the library and it does the rest, and then you write your code normally. Every other solution requires building new, similar infrastructure for every project (the external coordinator).
The language-native library approach is the far simpler and more repeatable way to do it.
As another example, an AI with the right context can build a DBOS app, often in one shot. An AI will take a ton of work to build your terraform or YAML to set up the infra you need to use an external coordinator.
The beauty of the DBOS solution is that it works great for tiny apps and massive apps with billions of checkpoints, and everything in between.
What if it takes a long time to process the callback? Some servers don't handle this well by default and you have to customise to make it work.
I use Django with Procrastinate which uses Postgres for the task backend. Took a while to find the right Django setup, but it works like a dream.
Yeah, that's usually when async becomes unavoidable for me too — long tasks, retries, scheduling.
Even with Postgres queues I still end up doing a fair bit of setup, which makes me wonder if this should live outside the app.
All the apps i've worked on lately in Rails use GoodJob, which is a Postgres NOTIFY/LISTEN based queue system.
Yes, I've seen a lot of Postgres-based queues lately too.
Even without Redis I still end up rebuilding some kind of job system on top of the DB, which is why I'm wondering if this should live outside the app entirely.
One approach that sidesteps the whole problem: design for fully synchronous, stateless requests from the start so there's nothing to queue.
I did this for a financial calculator API — every request is pure computation, inputs in, result out, nothing persisted. No Redis, no workers, no task table, no locking. The response is ready before a user would notice a queue anyway (sub-50ms).
Obviously only works when tasks complete in milliseconds. But figassis's pattern of "starts simple, then incrementally grows into a small job system anyway" often happens because the initial scope could have been fully synchronous — the async complexity creeps in before it's actually needed.
Worth asking first: does this task genuinely have to be async, or is it just easier to model it that way?
Yeah, that makes sense too. I also try to keep things synchronous as long as possible.
In practice async usually shows up once there are external APIs, retries, scheduling, or anything that shouldn't block the request, and that's where I end up building some kind of job system again.
I'm trying to figure out if that point happens often enough to justify moving this outside the app entirely.