I like to use uuid5 for this. It produces unique keys in a given namespace (defined by a uuid) but also takes an input key and produces identical output ID for the same input key.
This has a number of nice properties:
1. You don’t need to store keys in any special way. Just make them a unique column of your db and the db will detect duplicates for you (and you can provide logic to handle as required, eg ignoring if other input fields are the same, raising an error if a message has the same idempotent key but different fields).
2. You can reliably generate new downstream keys from an incoming key without the need for coordination between consumers, getting an identical output key for a given input key regardless of consumer.
3. In the event of a replayed message it’s fine to republish downstream events because the system is now deterministic for a given input, so you’ll get identical output (including generated messages) for identical input, and generating duplicate outputs is not an issue because this will be detected and ignored by downstream consumers.
4. This parallelises well because consumers are deterministic and don’t require any coordination except by db transaction.
This was my exact solution in the late 1990's that I formulated using a uid algorithm I created when confronted with a growing payment processing load issue that centralized hardware at the time could not handle. MsSQL could not process the ever increasing load yet the firehose of real-time payments transaction volume could not be turned off so an interim parallel solution involving microservices to walk everything over to Oracle was devised using this technique. Everything old is new again as the patterns and cycles ebb and flow.
Failure resistant systems end up having a bespoke implementation of a project management workflow built into them and then treating each task like a project to be managed from start to finish, with milestones along the way.
another POV is that solutions that require no long term "durable workflow" style storage provide exponentially more value. if you are making something that requires durable workflows, you ought to spend a little bit of time in product development so that it does not require durable workflows, instead of a ton of time making something that isn't very useful durable.
for example, you can conceive of a software vendor that does the end-to-end of a real estate transaction: escrow, banking, signature, etc. The IT required to support the model of such a thing would be staggering. Does it make sense to do that kind of product development? That is inventing all of SAP, on top of solving your actual problem. Or making the mistake of adopting temporal, trigger, etc., who think they have a smaller problem than making all of SAP and spend considerable resources convincing you that they do.
The status quo is that everyone focuses on their little part to do it as quickly as possible. The need for durable workflows is BAD. You should look at that problem as, make buying and selling homes much faster and simpler, or even change the order of things so that less durability is required; not re-enact the status quo as an IT driven workflow.
Why are real-estate transactions complex and full of paperwork? Because there are history books filled with fraud. There are other types of large transactions that also involve a lot of paperwork too, for the same reason.
Why does a company have extensive internal tracing of the progress of their business processes, and those of their customers? Same reason, usually. People want accountability and they want to discourage embezzlement and such things.
Interesting thought but how do you sell an idea that sounds like...
"How we've been doing things is wrong and I am going to redesign it in a way that no one else knows about so I don't have to implement the thing that's asked of me"
Haha, another way of describing what you are saying is enterprise sales: “give people exactly what they ask for, not what makes the most sense.”
Businesses that require enterprise sales are probably the worst performing category of seed investing. They encompass all of Ed tech and health tech, which are the two worst industry verticals for VC; and Y Combinator has to focus on an index of B2B services for other programmers because without that constraint, nearly every “do what you are asked for” would fail. Most of the IT projects business do internally fail!
In fact I think the idea you are selling is even harder, it is much harder to do B2B enterprise sales than knowing if the thing you are making makes sense and is good.
These strategies only really work for stream processing. You also want idempotent APIs which won't really work with these. You'd probably go for the strategy they pass over which is having it be an arbitrary string key and just writing it down with some TTL.
Here's what I don't understand about distributed systems: TCP works amazing, so why not use the same ideas? Every message increments a counter, so the receiver can tell the ordering and whether some message is missing. Why is this complicated?
UDP gives you practically no guarantees about anything. Forget exactly once processing, UDP doesn't even give you any kind of guarantees about delivery to begin with, whether delivery will happen at all, order of delivery, lack of duplicates, etc, nothing. These things are so far from comparable that this idea makes no sense even after trying real hard to steelman it.
> The more messages you need to process overall, the more attractive a solution centered around monotonically increasing sequences becomes, as it allows for space-efficient duplicate detection and exclusion, no matter how many messages you have.
It should be the opposite: with more messages you want to scale with independent consumers, and a monotonic counter is a disaster for that.
You also don’t need to worry about dropping old messages if you implement your processing to respect the commutative property.
You only need monotonicity per producer here, and even with independent producer and consumer scaling you can make tracking that tractable as long as you can avoid every consumer needing to know about every producer while also having a truly huge cardinality of producers.
I like to use uuid5 for this. It produces unique keys in a given namespace (defined by a uuid) but also takes an input key and produces identical output ID for the same input key.
This has a number of nice properties:
1. You don’t need to store keys in any special way. Just make them a unique column of your db and the db will detect duplicates for you (and you can provide logic to handle as required, eg ignoring if other input fields are the same, raising an error if a message has the same idempotent key but different fields).
2. You can reliably generate new downstream keys from an incoming key without the need for coordination between consumers, getting an identical output key for a given input key regardless of consumer.
3. In the event of a replayed message it’s fine to republish downstream events because the system is now deterministic for a given input, so you’ll get identical output (including generated messages) for identical input, and generating duplicate outputs is not an issue because this will be detected and ignored by downstream consumers.
4. This parallelises well because consumers are deterministic and don’t require any coordination except by db transaction.
This was my exact solution in the late 1990's that I formulated using a uid algorithm I created when confronted with a growing payment processing load issue that centralized hardware at the time could not handle. MsSQL could not process the ever increasing load yet the firehose of real-time payments transaction volume could not be turned off so an interim parallel solution involving microservices to walk everything over to Oracle was devised using this technique. Everything old is new again as the patterns and cycles ebb and flow.
Failure resistant systems end up having a bespoke implementation of a project management workflow built into them and then treating each task like a project to be managed from start to finish, with milestones along the way.
another POV is that solutions that require no long term "durable workflow" style storage provide exponentially more value. if you are making something that requires durable workflows, you ought to spend a little bit of time in product development so that it does not require durable workflows, instead of a ton of time making something that isn't very useful durable.
for example, you can conceive of a software vendor that does the end-to-end of a real estate transaction: escrow, banking, signature, etc. The IT required to support the model of such a thing would be staggering. Does it make sense to do that kind of product development? That is inventing all of SAP, on top of solving your actual problem. Or making the mistake of adopting temporal, trigger, etc., who think they have a smaller problem than making all of SAP and spend considerable resources convincing you that they do.
The status quo is that everyone focuses on their little part to do it as quickly as possible. The need for durable workflows is BAD. You should look at that problem as, make buying and selling homes much faster and simpler, or even change the order of things so that less durability is required; not re-enact the status quo as an IT driven workflow.
Chesterton's Fence, no?
Why are real-estate transactions complex and full of paperwork? Because there are history books filled with fraud. There are other types of large transactions that also involve a lot of paperwork too, for the same reason.
Why does a company have extensive internal tracing of the progress of their business processes, and those of their customers? Same reason, usually. People want accountability and they want to discourage embezzlement and such things.
Durable workflows are just distributed state machines. The complexity is there because guaranteeing a machine will always be available is impossible.
Interesting thought but how do you sell an idea that sounds like...
"How we've been doing things is wrong and I am going to redesign it in a way that no one else knows about so I don't have to implement the thing that's asked of me"
Haha, another way of describing what you are saying is enterprise sales: “give people exactly what they ask for, not what makes the most sense.”
Businesses that require enterprise sales are probably the worst performing category of seed investing. They encompass all of Ed tech and health tech, which are the two worst industry verticals for VC; and Y Combinator has to focus on an index of B2B services for other programmers because without that constraint, nearly every “do what you are asked for” would fail. Most of the IT projects business do internally fail!
In fact I think the idea you are selling is even harder, it is much harder to do B2B enterprise sales than knowing if the thing you are making makes sense and is good.
These strategies only really work for stream processing. You also want idempotent APIs which won't really work with these. You'd probably go for the strategy they pass over which is having it be an arbitrary string key and just writing it down with some TTL.
I like the uuid v7 approach - being able to reject messages that have aged past the idempotency key retention period is a nice safeguard.
Here's what I don't understand about distributed systems: TCP works amazing, so why not use the same ideas? Every message increments a counter, so the receiver can tell the ordering and whether some message is missing. Why is this complicated?
Not trying to be snarly, but you should read the article and come back to discuss. This specific point is adressdd.
Can't be bothered, I don't think it's that interesting.
TCP exists and it's amazing.
Multiple cores within a CPU also communicate perfectly.
So this is a solved problem. My suspicion is that the people who write articles on "distributed systems" aren't aware of what already exists.
TCP is a one to one relation, distributed systems are many to many.
You mean like UDP which also works amazing?
UDP gives you practically no guarantees about anything. Forget exactly once processing, UDP doesn't even give you any kind of guarantees about delivery to begin with, whether delivery will happen at all, order of delivery, lack of duplicates, etc, nothing. These things are so far from comparable that this idea makes no sense even after trying real hard to steelman it.
UDP doesn’t guarantee exactly once processing.
It needs a single consumer to be that simple.
And a single producer! i.e. it breaks down if you add support for fault tolerance
> The more messages you need to process overall, the more attractive a solution centered around monotonically increasing sequences becomes, as it allows for space-efficient duplicate detection and exclusion, no matter how many messages you have.
It should be the opposite: with more messages you want to scale with independent consumers, and a monotonic counter is a disaster for that.
You also don’t need to worry about dropping old messages if you implement your processing to respect the commutative property.
You only need monotonicity per producer here, and even with independent producer and consumer scaling you can make tracking that tractable as long as you can avoid every consumer needing to know about every producer while also having a truly huge cardinality of producers.
> It should be the opposite: with more messages you want to scale with independent consumers, and a monotonic counter is a disaster for that.
Is there any method for uniqueness testing that works after fan-out?
> You also don’t need to worry about dropping old messages if you implement your processing to respect the commutative property.
Commutative property protects if messages are received out of order. Duplicates require idempotency.