To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!
Interesting that what’s basically an ad is the top comment - it’s not like this is open source or anything - can’t even use it immediately (you have to apply for access). Totally proprietary. At least elasticsearch is APGL, saying nothing of open search which also supports use of S3
Someone made an informed technical bet that worked out. Sounds like HN material to me. (Also, is it really a useful ad if you can't easily use the product?)
Pretty much all other S3 implementations (including open source ones) support this or equivalent primitives, so this is great for interoperability with existing implementations.
My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.
S3 has supported SHA-256 as a checksum algo since 2022. You can calculate the hash locally and then specify that hash in the PutObject call. S3 will calculate the hash and compare it with the hash in the PutObject call and reject the Put if they differ. The hash and algo are then stored in the object's metadata. You simply also use the SHA-256 hash as the key for the object.
Unfortunately, for a multi-part upload it isn't a hash of the total object, it is a hash of the hashes for each part, which is a lot less useful. Especially if you don't know how the file was partititioned during upload.
And even if it was for the whole file, it isn't used for the ETag, so, so it can't be used for conditional PUTs.
I had a use case where this looked really promising, then I ran into the multipart upload limitations, and ended up using my own custom metadata for the sha256sum.
That's interesting. Would you want it to be something like a bucket setting, like "any time an object is uploaded, don't let an object write complete unless S3 verifies that a pre-defined hash function (like SHA256) is called to verify that the object's name matches the object's contents?"
Is there any reason you can't enforce that restriction on your side? Or are you saying you want S3 to automatically set the name for you based on the hash?
> Is there any reason you can't enforce that restriction on your side?
I'd like to set IAM permissions for a role, so that that role can add objects to the content-addressible store, but only if their name matches the hash of their content.
> Or are you saying you want S3 to automatically set the name for you based on the hash?
I'm happy to name the files myself, if I can get S3 to enforce that. But sure, if it were easier, I'd be thrilled to have S3 name the files by hash, and/or support retrieving files by hash.
I think you can presign PutObject calls that validate a particular SHA-256 checksum. An API endpoint, e.g. in a Lambda, can effectively enforce this rule. It unfortunately won’t work on multipart uploads except on individual parts.
That will probably never happen because of the fundamental nature of blob storage.
Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.
Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.
What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.
Why would you PUT an object, then download it again to a central server in the first place? If a service is accepting an upload of the bytes, it is already doing a pass over all the bytes anyway. It doesn't seem like a ton of overhead to calculate SHA256 in the 4092-byte chunks as the upload progresses. I suspect that sort of calculation would happen anyways.
You're right, and in fact S3 does this with the `ETag:` header… in the simple case.
S3 also supports more complicated cases where the entire object may not be visible to any single component while it is being written, and in those cases, `ETag:` works differently.
> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
> * If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption. If an object is larger than 16 MB, the AWS Management Console will upload or copy that object as a Multipart Upload, and therefore the ETag will not be an MD5 digest.
Why does it matter where the bytes are stored at rest? Isn't everything you need for SHA-256 just the results of the SHA-256 algorithm on every 4096-byte block? I think you could just calculate that as the data is streamed in.
Why does the architect of blob storage matter? The hash can be calculated as data streams in for the first write, before data gets dispersed into multiple physically stored blocks.
It is common to use multipart uploads for large objects, since this both increases throughput and decreases latency. Individual part uploads can happen in parallel and complete in any sequence. There's no architectural requirement that an entire object pass through a single system on either S3's side or on the client's side.
Isn't that the point of the metadata? Calculate the hash ahead of time and store it in the metadata as part of the atomic commit for the blob (at least for S3).
I’d wager that the algorithm is slightly eager to throw a consistency error if it’s unable to verify across partitions. Since the caller is naturally ready for this error, it’s likely not a problem. So in short it’s the P :)
https://tqdev.com/2024-the-p-in-cap-is-for-performance is a really interesting take on this as a response to https://blog.dtornow.com/the-cap-theorem.-the-bad-the-bad-th... - essentially, the only way to get CA is if you're willing to say that every request will succeed eventually, but it might take an unbounded amount of time for partitions to heal, and you have to be willing to wait indefinitely for that to happen. Which can indeed make sense for asynchronous messaging, but not for real-time applications as we think about them in the modern day. In practice, if you're talking about CAP for high-performance systems, you're choosing either CP or AP.
You can design to minimize P, though. For instance, if you have all the services running on the same physical box, and make people enter the room to use it instead of over the Internet, "partition" becomes much less likely. (This example is a bit silly.)
But you're right, if you take a broad view of P, the choice is really between consistency and availability.
Noting that Azure Blob storage supports e-tag / optimistic controls as well (via If-Match conditions)[1], how does this differ? Or is it the same feature?
s3fs's https://github.com/fsspec/s3fs/pull/917 was in response to the IfNoneMatch feature from the summer. How would people imagine this new feature being surfaced in a filesystem abstraction?
This combined with the read-after-write consistency guarantee is a perfect building block (pun intended) for incremental append only storage atop an object store. It solves the biggest problem with coordinating multiple writers to a WAL.
Ironically with this and lambda you could make a serverless sqlite by mapping pages to objects, using http range reads to read the db and lambda to translate queries to the writes in the appropriate pages via cas. Prior to this it would require a server to handle concurrent writers, making the whole thing a nonstarter for “serverless”.
Too bad performance would be terrible without a caching layer (ebs).
If my memory of parallel algorithms class serves me right, you can build any synchronization algorithm on top of compare-and-swap as an atomic primitive.
As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:
- Download the current database copy
- Perform your write locally
- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.
- If you get success, consider the transaction successful.
- If you get failure, go back to step 1 and try again.
The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.
Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)
(I assume) it will fail if the eTag doesn't match -- the instance it got the header.
The main point of it is: I have an object that I want to mutate. I think I have the latest version in memory. So I update in memory and upload it to S3 with the eTag of the version I have and tell it to only commit if that is the latest version. If it "fails", I re-download the object, re-apply the mutation, and try again.
Practically, they could do both: Do an early reject of a given POST in case the ETag does not match, but re-validate this just before swapping out the objects (and committing to considering the given request as the successful one globally).
That said, I'm not sure if common HTTP libraries look at response headers before they're done posting a response body, or if that's even allowed/possible in HTTP? It seems feasible at a first glance with chunked encoding, at least.
Edit: Upon looking a bit, it seems that informational response codes, e.g. 100 (Continue) in combination with Expect 100-continue in the requests, could enable just that and avoid an extra GET with If-Match.
I can imagine it might be useful to make this a choice for databases with high frequency small swaps and occasional large ones.
1) default, load-compare-&-swap for small fast load/swaps.
2) optional, compare-load-&-swap to allow a large load to pass its compare, and cut in front of all the fast small swap that would otherwise create an un-hittable moving target during its long loads for its own compare.
3) If the load itself was stable relative to the compare, then it could be pre-loaded and swapped into a holding location, followed by as many fast compare-&-swaps as needed to get it into the right location.
If the default ETag algorithm for non-encrypted, non-multipart uploads in AWS is a plain MD5 hash, is this subject to failure for object data with MD5 collisions?
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
Sure, but theoretically you could have a system where a distributed log of user generated content is built via this CAS//MD5 primitive. A malicious actor could craft the data such that entries are dropped.
So...are we closer to getting to use S3 as a...you guessed it...a database? With CAS, we are probably able to get a basic level of atomicity, and S3 itself is pretty durable, now we have to deal with consistency and isolation...although S3 branded itself as "eventually consistent"...
There was a great deal of interest in gossip protocols, eventual consistency, and such at Amazon in the mid oughts. So much so that they hired a certain Cornell professor along with the better part of his grad students to build out those technologies.
Edit: Fun fact: the engineer that was instrumental to making it happen caught some heat for collecting so many referral bonuses. Rumor was that some folks on the recruiting team were mad an engineer was making more from recruiting than they were.
Does this mean, in theory we will be able to manage multiple concurrent writes/updates to s3 without having to use new solutions like Regatta[1] that was recently launched?
I implemented that extension in R2 at launch IIRC. Thanks for catching up & helping move distributed storage applications a meaningful step forward. Intended sincerely. I'm sure adding this was non-trivial for a complex legacy codebase like that.
It ensures that when you try to upload (or “put”) a new version of a file, the operation only succeeds if the file on the server still has the exact version (ETag) you specify. If someone else has updated the file in the meantime, your upload is blocked to prevent overwriting their changes.
This is especially useful in scenarios where multiple users or processes are working on the same data, as it helps maintain consistency and avoids accidental overwrites.
This is using the same mechanism as HTTP's `If-None-Match` header so it's easier to implement/learn
To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!
https://turbopuffer.com/blog/turbopuffer
Interesting that what’s basically an ad is the top comment - it’s not like this is open source or anything - can’t even use it immediately (you have to apply for access). Totally proprietary. At least elasticsearch is APGL, saying nothing of open search which also supports use of S3
Someone made an informed technical bet that worked out. Sounds like HN material to me. (Also, is it really a useful ad if you can't easily use the product?)
Worked out how? There’s no implementation. It’s just conjecture.
Pretty much all other S3 implementations (including open source ones) support this or equivalent primitives, so this is great for interoperability with existing implementations.
It's also possible to enforce the use of conditional writes: https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3...
My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.
S3 has supported SHA-256 as a checksum algo since 2022. You can calculate the hash locally and then specify that hash in the PutObject call. S3 will calculate the hash and compare it with the hash in the PutObject call and reject the Put if they differ. The hash and algo are then stored in the object's metadata. You simply also use the SHA-256 hash as the key for the object.
https://aws.amazon.com/blogs/aws/new-additional-checksum-alg...
Unfortunately, for a multi-part upload it isn't a hash of the total object, it is a hash of the hashes for each part, which is a lot less useful. Especially if you don't know how the file was partititioned during upload.
And even if it was for the whole file, it isn't used for the ETag, so, so it can't be used for conditional PUTs.
I had a use case where this looked really promising, then I ran into the multipart upload limitations, and ended up using my own custom metadata for the sha256sum.
That's interesting. Would you want it to be something like a bucket setting, like "any time an object is uploaded, don't let an object write complete unless S3 verifies that a pre-defined hash function (like SHA256) is called to verify that the object's name matches the object's contents?"
You can already put with a sha256 hash. If it fails it just returns an error.
Is there any reason you can't enforce that restriction on your side? Or are you saying you want S3 to automatically set the name for you based on the hash?
> Is there any reason you can't enforce that restriction on your side?
I'd like to set IAM permissions for a role, so that that role can add objects to the content-addressible store, but only if their name matches the hash of their content.
> Or are you saying you want S3 to automatically set the name for you based on the hash?
I'm happy to name the files myself, if I can get S3 to enforce that. But sure, if it were easier, I'd be thrilled to have S3 name the files by hash, and/or support retrieving files by hash.
I think you can presign PutObject calls that validate a particular SHA-256 checksum. An API endpoint, e.g. in a Lambda, can effectively enforce this rule. It unfortunately won’t work on multipart uploads except on individual parts.
The hash of multipart uploads is simply the hash of all the part hashes. I've been able to replicate it.
Could you use a meta field from the object and save the hash in it, running a compare from it?
That will probably never happen because of the fundamental nature of blob storage.
Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.
Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.
What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.
Why would you PUT an object, then download it again to a central server in the first place? If a service is accepting an upload of the bytes, it is already doing a pass over all the bytes anyway. It doesn't seem like a ton of overhead to calculate SHA256 in the 4092-byte chunks as the upload progresses. I suspect that sort of calculation would happen anyways.
You're right, and in fact S3 does this with the `ETag:` header… in the simple case.
S3 also supports more complicated cases where the entire object may not be visible to any single component while it is being written, and in those cases, `ETag:` works differently.
> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
> * If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption. If an object is larger than 16 MB, the AWS Management Console will upload or copy that object as a Multipart Upload, and therefore the ETag will not be an MD5 digest.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.h...
S3 supports multipart uploads which don’t necessarily send all the parts to the same server.
Why does it matter where the bytes are stored at rest? Isn't everything you need for SHA-256 just the results of the SHA-256 algorithm on every 4096-byte block? I think you could just calculate that as the data is streamed in.
The data is not necessarily "streamed" in! That's a significant design feature to allow parallel uploads of a single object using many parts ("blocks"). See: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMu...
You have just re-invented IPFS! https://en.m.wikipedia.org/wiki/InterPlanetary_File_System
Why does the architect of blob storage matter? The hash can be calculated as data streams in for the first write, before data gets dispersed into multiple physically stored blocks.
It is common to use multipart uploads for large objects, since this both increases throughput and decreases latency. Individual part uploads can happen in parallel and complete in any sequence. There's no architectural requirement that an entire object pass through a single system on either S3's side or on the client's side.
Isn't that the point of the metadata? Calculate the hash ahead of time and store it in the metadata as part of the atomic commit for the blob (at least for S3).
Be still my beating heart. I have lived to see this day.
Genuinely, we've wanted this for ages and we got half way there with strong consistency.
Might finally be possible to do this on S3: https://pkg.go.dev/github.com/ncruces/go-gcp/gmutex
So....given CAP, which one did they give up
I thought they have implemented Optimistic locking now to coordinate concurrent writes. How does it change anything in CAP?
I’d wager that the algorithm is slightly eager to throw a consistency error if it’s unable to verify across partitions. Since the caller is naturally ready for this error, it’s likely not a problem. So in short it’s the P :)
Shouldn't that be the A then? Since the network partition is still there but availability is non-guaranteed.
Yes, definitely. Good point (I was knee jerk assuming the A is always chosen and the real “choice” is between C and P).
https://tqdev.com/2024-the-p-in-cap-is-for-performance is a really interesting take on this as a response to https://blog.dtornow.com/the-cap-theorem.-the-bad-the-bad-th... - essentially, the only way to get CA is if you're willing to say that every request will succeed eventually, but it might take an unbounded amount of time for partitions to heal, and you have to be willing to wait indefinitely for that to happen. Which can indeed make sense for asynchronous messaging, but not for real-time applications as we think about them in the modern day. In practice, if you're talking about CAP for high-performance systems, you're choosing either CP or AP.
Well, P isn't really much of a choice, I don't think you can opt out of acts of god.
You can design to minimize P, though. For instance, if you have all the services running on the same physical box, and make people enter the room to use it instead of over the Internet, "partition" becomes much less likely. (This example is a bit silly.)
But you're right, if you take a broad view of P, the choice is really between consistency and availability.
A tiny bit of availability, unnoticeable at web scale.
Noting that Azure Blob storage supports e-tag / optimistic controls as well (via If-Match conditions)[1], how does this differ? Or is it the same feature?
[1]: https://learn.microsoft.com/en-us/azure/storage/blobs/concur...
It's the same feature. Google Cloud Storage has it too: https://cloud.google.com/storage/docs/request-preconditions#...
s3fs's https://github.com/fsspec/s3fs/pull/917 was in response to the IfNoneMatch feature from the summer. How would people imagine this new feature being surfaced in a filesystem abstraction?
This combined with the read-after-write consistency guarantee is a perfect building block (pun intended) for incremental append only storage atop an object store. It solves the biggest problem with coordinating multiple writers to a WAL.
Rename for objects and “directories” also. Atomic.
Ironically with this and lambda you could make a serverless sqlite by mapping pages to objects, using http range reads to read the db and lambda to translate queries to the writes in the appropriate pages via cas. Prior to this it would require a server to handle concurrent writers, making the whole thing a nonstarter for “serverless”.
Too bad performance would be terrible without a caching layer (ebs).
For read heavy workloads, you could cache the results at cloudfront. Maybe we will someday see Wordpress-on-Lambda-to-Sqlite-over-S3.
I feel dumb for asking this, but can someone explain why this is such a big deal? I’m not quite sure I am grokking it yet.
If my memory of parallel algorithms class serves me right, you can build any synchronization algorithm on top of compare-and-swap as an atomic primitive.
As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:
- Download the current database copy
- Perform your write locally
- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.
- If you get success, consider the transaction successful.
- If you get failure, go back to step 1 and try again.
The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.
Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)
(I assume) it will fail if the eTag doesn't match -- the instance it got the header.
The main point of it is: I have an object that I want to mutate. I think I have the latest version in memory. So I update in memory and upload it to S3 with the eTag of the version I have and tell it to only commit if that is the latest version. If it "fails", I re-download the object, re-apply the mutation, and try again.
I imagine, for it to make sense, that the comparison is done at the last possible moment, before atomically swapping the file contents.
Practically, they could do both: Do an early reject of a given POST in case the ETag does not match, but re-validate this just before swapping out the objects (and committing to considering the given request as the successful one globally).
That said, I'm not sure if common HTTP libraries look at response headers before they're done posting a response body, or if that's even allowed/possible in HTTP? It seems feasible at a first glance with chunked encoding, at least.
Edit: Upon looking a bit, it seems that informational response codes, e.g. 100 (Continue) in combination with Expect 100-continue in the requests, could enable just that and avoid an extra GET with If-Match.
I can imagine it might be useful to make this a choice for databases with high frequency small swaps and occasional large ones.
1) default, load-compare-&-swap for small fast load/swaps.
2) optional, compare-load-&-swap to allow a large load to pass its compare, and cut in front of all the fast small swap that would otherwise create an un-hittable moving target during its long loads for its own compare.
3) If the load itself was stable relative to the compare, then it could be pre-loaded and swapped into a holding location, followed by as many fast compare-&-swaps as needed to get it into the right location.
When you upload a change you can know you're not clobbering changes you never saw.
If the default ETag algorithm for non-encrypted, non-multipart uploads in AWS is a plain MD5 hash, is this subject to failure for object data with MD5 collisions?
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
MD5 hash collisions are unlikely to happen at random. The defect was that you can make it happen purposefully, making it useless for security.
Sure, but theoretically you could have a system where a distributed log of user generated content is built via this CAS//MD5 primitive. A malicious actor could craft the data such that entries are dropped.
The default Etag is used to detect bit errors and and MD5 is fine for that. S3 does support using SHA256 instead.
I can't wait to see what abomination Cory Quinn can come up with now given this new primitive! (see previous work abusing Route53 as a database: https://www.lastweekinaws.com/blog/route-53-amazons-premier-...)
So...are we closer to getting to use S3 as a...you guessed it...a database? With CAS, we are probably able to get a basic level of atomicity, and S3 itself is pretty durable, now we have to deal with consistency and isolation...although S3 branded itself as "eventually consistent"...
People who want all those features use something like Delta Lake on top of object storage.
There was a great deal of interest in gossip protocols, eventual consistency, and such at Amazon in the mid oughts. So much so that they hired a certain Cornell professor along with the better part of his grad students to build out those technologies.
Edit: Fun fact: the engineer that was instrumental to making it happen caught some heat for collecting so many referral bonuses. Rumor was that some folks on the recruiting team were mad an engineer was making more from recruiting than they were.
Finally. GCP has had this for a long time. Years ago I was surprised S3 didn’t.
GCS is just missing x-amz-copy-source-range in my book.
Can we have this Google?
…
Please?
GCP still doesn't have triggers out of beta last time i checked (which was a while ago).
Gmail was in beta for five years, I don't think that label really means anything.
It means that Google doesn't want to offer an SLA
Does this mean, in theory we will be able to manage multiple concurrent writes/updates to s3 without having to use new solutions like Regatta[1] that was recently launched?
https://news.ycombinator.com/item?id=42174204
First thing I thought when I saw the headline was "oh! I should tell Sirupsen"
[rejected] error: failed to push some refs to remote repository
Finally we can have this with s3 :)
I implemented that extension in R2 at launch IIRC. Thanks for catching up & helping move distributed storage applications a meaningful step forward. Intended sincerely. I'm sure adding this was non-trivial for a complex legacy codebase like that.
Could anybody explain for the uninitiated?
It ensures that when you try to upload (or “put”) a new version of a file, the operation only succeeds if the file on the server still has the exact version (ETag) you specify. If someone else has updated the file in the meantime, your upload is blocked to prevent overwriting their changes.
This is especially useful in scenarios where multiple users or processes are working on the same data, as it helps maintain consistency and avoids accidental overwrites.
This is using the same mechanism as HTTP's `If-None-Match` header so it's easier to implement/learn
good example of how a simple feature on the surface (a header comparison) requires tremendous complexity and capacity on the backend.
S3 is rated as "durable" as opposed to "best effort." It has lots of interesting guarantees as a result.
Also they are faithful to their consistency commitments
bender_neat.gif