> Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope write to files safely, like databases and version control systems: Leveldb, LMDB, GDBM, HSQLDB, Sqlite, PostgreSQL, Git, Mercurial, HDFS, Zookeeper. They then wrote a static analysis tool that can find incorrect usage of the file API, things like incorrectly assuming that operations that aren't atomic are actually atomic, incorrectly assuming that operations that can be re-ordered will execute in program order, etc.
> When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?
> why the file API so hard to use that even experts make mistakes?
I think the short answer is that the APIs are bad. The POSIX fs APIs and associated semantics are so deeply entrenched in the software ecosystem (both at the OS level, and at the application level) that it's hard to move away from them.
I take a different view on this. IMO the tricks that existing file systems play to get more performance (specifically around ordering and atomicity) make it extra hard for developers to reason about. Obviously, you can't do anything about fsync dropping error codes, but some of these failure modes just aren't possible over file systems like NFS due to protocol semantics.
Not only that, but the POSIX file API also assumes that NFS is a thing but NFS breaks half the important guarantees of a file system. I don’t know if it’s a baby and bath water situation, but NFS just seems like a whole bunch of problems. It’s like having eval in a programming language.
Lack of inotify support is one that has annoyed me in the past. It not only breaks some desktop software, but it also should be possible for NFS to support (after all, the server sees the changes and could notify clients).
Thanks for this, it's helpful. Totally heard about O_APPEND and read() returning -EACCESS. The other ones, I agree, should be fixed in later versions of the Linux kernel/NFS client.
Just ran into this one recently trying to replace Docker w/ Podman for a CICD runner. Before anyone protests we have very strong, abnormal requirements on my project preventing most saner architectures. It wasn’t the root cause but the failure behavior was weird due to the behavior you just described.
Not really, there's been lots of APIs that have improved on the POSIX model.
The kind of model I prefer is something based on atomicity. Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly. (Note that something like writeFileAtomic is already a common primitive in many high-level filesystem APIs, and it's something that's already easily buildable with regular POSIX APIs). For cases like logging, you can extend the model slightly with atomic appends, where the only kind of write allowed is to atomically append a chunk of data to the file (so readers can only possibly either see no new data or the entire chunk of data at once).
I'm less knowledgeable about the way DBs interact with the filesystem, but there the solution is probably ditching the concept of the file stream entirely and just treating files as a sparse map of offsets to blocks, which can be atomically updated. (My understanding is that DBs basically do this already, except that "atomically updated" is difficult with the current APIs).
> Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly.
int fd = open(".config", O_RDWR | O_CREAT | O_SYNC_ON_CLOSE, 0o666);
// effects of calls to write(2)/etc. are invisible through any other file description
// until the close(2) is called on all descriptors to this file description.
close(fd);
So now you can watch for e.g. either IN_MODIFY or IN_CLOSE_WRITE (and you don't need to balance it with IN_OPEN), it doesn't matter, you'll never see partial updates... would be nice!
It's not hard to design a less bug-prone API that would enable you to do everything the POSIX file API permits and admits equally-high-performance implementations. But making that new API a replacement for the POSIX API would require rewriting essentially all of the software that somebody cares about to use your new, better API instead of the POSIX API. This is probably only feasible in practice for small embedded systems with a fairly small universe of software.
You could do a phased transition, where both the legacy posix api and the new api are available. This has already happened with a lot of the old C standard library. Old, unsafe functions like strcpy were gradually replaced by safer alternatives like strncpy.
Database developers don’t want the complexity or poor performance of posix. It’s wild to me that we still don’t have any alternative to fsync in Linux that can act as a barrier without also flushing caches at the same time.
There are two serious factual errors in your comment:
- This has not already happened with a lot of the old C standard library. The only function that has ever been removed from the C standard library, to my knowledge, is gets(). In particular, strcpy() has not been removed. Current popular compilers still support gets() with the right options, so it hasn't been removed from the actual library, just the standard.
- strncpy() is not a suitable replacement for strcpy(), certainly not a safer one. It can produce strings missing the terminating null, and it can be slower by orders of magnitude. This has been true since it was introduced in the 01970s. Nearly every call to strncpy() is a bug, and in many cases an exploitable security hole. You are propagating dangerous misinformation. (This is a sign of how difficult it is to make these transitions.)
You also seem to imply that Linux cannot add system calls that are not specified in POSIX, but of course it can and does; openat() and the other 12 related functions, epoll_*(), io_uring_*(), futex_*(), kexec_load(), add_key(), and many others are Linux-specific. The reason barrier() hasn't been added is evidently that the kernel developers haven't been convinced it's worthwhile in the 15+ years since it was proposed, not that POSIX ties their hands.
The nearest equivalents in C for the kind of "staged transition" you are proposing might be things like the 16-bit near/far/huge qualifiers and the Win16 and pre-X MacOS programming models. In each of these cases, a large body of pre-existing software was essentially abandoned and replaced by newly written software.
Yeah, I understand that those methods are still available. But their use is heavily discouraged in new software and a lot of validators & sanitisers will warn if your programs use them. Software itself has largely slowly moved to using the newer, safer methods even though the old methods were never taken away.
I don’t understand the reticence of kernel developers to implement a barrier syscall. I know they could do it. And as this article points out, it would dramatically improve database performance for databases which make use of it. Why hasn’t it happened?
Another commenter says NVMe doesn’t support it natively but I bet hardware vendors would add hardware support if Linux supported it and adding barrier support to their hardware would measurably improve the performance of their devices.
Sure, adding that functionality to NVMe would be easy; there are sufficient provisions around for adding such support. E.g. for example a global flag who's support is communicated and which can then be turned on by the host to cause the very same normal flush opcode to now also guarantee a pipelined write barrier (while retaining the flush-write-back-caches-before-reporting-completion-of-this-submitted-IO-operation).
The reason it hadn't yet been supported btw. is that they explicitly wanted to allow fully parallel processing of commands in a queue, at least for submissions that concurrently exist in the command queue. In practice I don't see why this would have to be enforced to such an extend, as the only reason for out-of-order processing I can think of is that the auxiliary data of a command is physically located in host memory and the DMA reads across PCIe from the NVMe controller to the host memory happen to complete out-of-order for host DRAM controller/pattern reasons.
Thus it might be something you'd not want to turn on without using controller memory buffer (where you can mmap some of the DRAM on the NVMe device into host memory, write your full-detail commands directly to this across the PCIe, and keep the NVMe controller from having to first send a read request across PCIe in response to you ringing it's doorbell: instead it can directly read from it's local DRAM when you ring the doorbell).
It sounds like you aren't very familiar with C; for example, C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.
Unless you mean memcpy(), there is in fact no safer alternative function in the C standard for strcpy(); software has not largely moved to not using strcpy() (plenty of new C code uses it); and most validators and sanitizers do not emit warnings for strcpy(). There is a more extensive explanation of this at https://software.codidact.com/posts/281518. GCC has warnings for some uses of strcpy(), but only those that can be statically guaranteed to be incorrect: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
Newer, safer alternatives to strcpy() include strlcpy() and strscpy() (see https://lwn.net/Articles/659818/), neither of which is in Standard C yet. Presumably OpenBSD has some sort of validator that recommends replacing strcpy() with strlcpy(), which is licensed such that you can bundle it with your program. Visual C++ will invite you to replace your strcpy() calls with the nonstandard Microsoft extension strcpy_s(), thus making your code nonportable and, as it happens, also buggy. An incompatible version of strcpy_s() has been added as an optional annex to the C11 standard. https://nullprogram.com/blog/2021/07/30/ gives extensive details, summarized as "there are no correct or practical implementations". The Linux kernel's checkpatch.pl will invite you to replace calls to strcpy() with calls to the nonstandard Linux/BSD extension strscpy(), but it's a kernel-specific linter.
So there are not literally zero validators and sanitizers that will warn on all uses of strcpy() in C, but most of them don't.
— ⁂ —
I don't know enough about the barrier()/osync() proposal to know why it hasn't been adopted, and obviously neither do you, since you can't know anything significant about Linux kernel internals if you think that C has methods or that strncpy() is a safer alternative to strcpy().
But I can speculate! I think we can exclude the following possibilities:
- That the paper, which I haven't read much of, just went unnoticed and nobody thought of the barrier() idea again. Luu points out that it's a sort of obvious idea for kernel developers; Chidambaram et al. ("Optimistic Crash Consistency") weren't even the first ones to propose it (and it wasn't even the main topic of their paper); and their paper has been cited in hundreds of other papers, largely in systems software research on SSDs: https://scholar.google.com/scholar?cites=1238063331053768604...
- That it's a good idea in theory, but implementing even a research prototype is too much work. Chidambaram et al.'s code is available at https://github.com/utsaslab/optfs, and it is of course GPLv2, so that work is already done for you. You can download a VM image from https://research.cs.wisc.edu/adsl/Software/optfs/ for testing.
- That authors of databases don't care about performance. The authors of SQLite, which is what Chidambaram et al. used in their paper, dedicate a lot of effort to continuously improving its performance: https://www.sqlite.org/cpu.html and it's also a major consideration for MariaDB and PostgreSQL.
- That there's an existing production-ready implementation that Linus is just rejecting because he's stubborn. If that were true, you'd see an active community around the OptFS patch, Red Hat applying it to their kernels (as they do with so many other non-Linus-accepted patches), etc.
- That it relies on asynchronous barrier support in the hardware interface, as the other commenter suggested. It doesn't.
So what does that leave?
Maybe the paper was wrong, which seems unlikely, or applicable only to niche cases. You should be able to build and run their benchmarks.
Maybe it was right at the time on spinning rust ("a Hitachi DeskStar 7K1000.B 1 TB drive") but wrong on SSDs, whose "seek time" is two to three orders of magnitude faster.
In particular, maybe it uses too much CPU.
Maybe it was right then and is still right but the interface has other drawbacks, for example being more bug-prone, which also seems unlikely, or undesirably constrains the architecture of other aspects of the kernel, such as the filesystem, in order to work well enough. (You could implement osync() as a filesystem-wide fsync() as a fallback, so this would just reduce the benefits, not increase the costs.)
Maybe it's obviously the right thing to do but nobody cares enough about it to step up and take responsibility for bringing the new system call up to Linus's standards and committing to maintain it over time.
If it was really a big win for database performance, you'd think one of the developers of MariaDB, PostgreSQL, or SQLite would have offered, or maybe one of the financial sponsors of the paper, which included Facebook and EMC. Luu doesn't say Twitter used the OptFS patch when he was on the Linux kernel team there; perhaps they used it secretly, but more likely they didn't find its advantages compelling enough to use.
Out of all these unlikely cases, my best guess is either "applicable only to niche cases", "wrong on SSDs", or "undesirably constrains filesystem implementation".
As a note on tone, some people may find it offputting when you speak so authoritatively about things you don't know anything about.
> C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.
This is an overly pedantic, ungenerous interpretation of what I wrote.
First, fine - you can argue that C has functions, not methods. But eh.
Second, for all practical purposes, C on Linux does have a standard library. It’s just - as you mentioned - not quite the same on every platform. We wouldn’t be talking about strcpy if C had no standard library equivalent.
Third, thankyou for the suggestion that there are even better examples than strcpy -> strncpy that I could have used to make my point more strongly. I should have chosen sprintf, gets or scanf.
I’ve been out of the game of writing C professionally for 15 years or so. I know a whole lot more about C than most. But memories fade with time. Thanks for the corrections. Likewise, no need to get snarky with them.
The kernel could implement a non-flushing barrier, even if the underlying device doesn't. You could even do it without any barrier support at all from the underlying device, as long as it reliably tells you when each request has completed; you just don't send it any requests from after the barrier until all the requests before the barrier have completed.
That would not work as you describe it. The device will return completion upon the writes reaching its cache. You need a flush to ensure that the data reaches stable storage.
You could probably abuse Force Unit Access to make it work by marking all IOs as Force Unit Access, but a number of buggy devices do not implement FUA properly, which defeats the purpose of using it. That would be why Microsoft disabled the NTFS feature that uses FUA on commodity hardware:
What you seem to want is FreeBSD’s UFS2 Softupdates that uses force unit access to avoid the need for flushes for metadata updates. It has the downside that it is unreliable on hardware that does not implement FUA properly. Also, UFS2 softupdates does not actually implement do anything to protect data when fsync(2) is called if this mailing list email is accurate:
> Synchronous writes (or BIO_FLUSH) are needed to handle O_SYNC/fsync(2) properly, which UFS currently doesn't care about.
That said, avoiding flushes for a fsync(2) would require doing FUA on all IOs. Presumably, this is not done because it would make all requests take longer all the time, raising queue depths and causing things to have to wait for queue limits more often, killing performance. Raising the OS queue depth to compensate would not work since SATA has a maximum queue depth of 32, although it might work for NVMe where the maximum queue depth is 65536, if keeping track of an increased number of inflight IOs does not cause additional issues at the storage devices (such as IOs that never complete as long as the device is kept busy because the device will keep reordering them to the end of the queue).
Using FUA only on metadata as is done in UFS2 soft updates improves performance by eliminating the need for journalling in all cases but the case of space usage, which still needs journalling (or fsck after power loss if you choose to forgo it).
Writes in the POSIX API can be atomic depending on the underlying filesystem. For example, small writes on ZFS through the POSIX API are atomic since they either happen in their entirety or they do not (during power failure), although if the writes are big enough (spanning many records), they are split into separate transactions and partial writes are then possible:
> make whole file read/writes atomic with a copy-on-write model,
I have many files that are several GB. Are you sure this is a good idea? What if my application only requires best effort?
> eliminate whole classes of filesystem bugs pretty quickly.
Block level deduplication is notoriously difficult.
> where the only kind of write allowed is to atomically append a chunk of data to the file
Which sounds good until you think about the complications involved in block oriented storage medium. You're stuck with RMW whether you think you're strictly appending or not.
It doesn’t have to be one or the other. Developers could decide by passing flags to open.
But even then, doing atomic writes of multi gigabyte files doesn’t sound that hard to implement efficiently. Just write to disk first and update the metadata atomically at the end. Or whenever you choose to as a programmer.
The downside is that, when overwriting, you’ll need enough free space to store both the old and new versions of your data. But I think that’s usually a good trade off.
It would allow all sorts of useful programs to be written easily - like an atomic mode for apt, where packages either get installed or not installed. But they can’t be half installed.
That is not what an atomic write() function does and we are talking about APT, not databases.
If you want atomic updates with APT, you could look into doing prestaged updates on ZFS. It should be possible to retrofit it into APT. Have it update a clone of the filesystem and create a new boot environment after it is done. The boot environment either is created or is not created. Then reboot into the updated OS and you can promote the clone and delete the old boot environment afterward. OpenSolaris had this capability over a decade ago.
> Databases implemented atomic transactions in the 70s.
And they have deadlocks as a result, which there is no good easy solution to (generally we work around by having only one program access a given database at a time, and even that is not 100% reliable).
Eh. Deadlocks can be avoided if you don’t use sql’s exact semantics. For example, foundationdb uses mvcc such that if two conflicting write transactions are committed at the same time, one transaction succeeds and the other is told to retry.
It works great in practice, even with a lot of concurrent clients. (iCloud is all built on foundationdb).
Hold & lock is what causes deadlocks. I agree with you - that would be a bad way to implement filesystem transactions. But we have a lot of other options.
This is kind of an interesting thought that more mirrors how Docker uses OverlayFS to track changes to the entire file system. No need for new file APIs.
> Developers could decide by passing flags to open.
Provided the underlying VFS has implemented them. They may not. Hence the point in the article that some developers only choose to support 'ext4' and nothing else.
> you’ll need enough free space to store both the old and new versions of your data.
The sacrifice is increased write wear on solid state devices.
> It would allow all sorts of useful programs to be written easily
Sure. As long as you don't need multiple processes to access the same file simultaneously. I think the article misses this point, too, in that, every FS on a multi user system is effectively a "distributed system." It's not distributed for _redundancy_ but it doesn't eliminate the attendant challenges.
They say ecryptfs is only supported when it is backed by ext4, which is a bit strange. I wonder if that is documented just to be able to close support cases when ecryptfs is used on top of a filesystem that is missing extended attribute support and their actual code does not actually check what is below ecryptfs. Usually the application above would not know what is below ecryptfs, so they would need to go out of their way to check this in order to enforce that. I do not use Dropbox, so someone else would need to test to see if they actually enforce that if curious enough.
Yes, a feature like this would need cooperation with the filesystem. But that’s just an implementation problem. That’s like saying we can’t add flexbox to browsers because all the browsers would need to add it. So?
As for wear on SSDs, I don’t think it would increase wear. You’re writing the same number of sectors on the drive. A 2gb write would still write 2gb (+ negligible metadata overhead). Why would the drive wear out faster in this scheme?
And I think it would work way better with multiple processes than the existing system. Right now the semantics when multiple processes edit the same file at once are somewhat undefined. With this approach, files would have database like semantics where any reader would either see the state before a write or the state after. It’s much cleaner - since it would become impossible for skewed reads or writes to corrupt a shared file.
Would you argue against the existence of database transactions? Of course not. Nobody does. They’re a great idea, and they’re way easier to reason about and use correctly compared to the POSIX filesystem api. I’m saying we should have the same integrity guarantees on the filesystem. I think if we had those guarantees already, you’d agree too.
Some of the problems transcend POSIX. Someone I know maintains a non-relational db on IBM mainframes. When diving into a data issue, he was gob-smacked to find out that sync'd writes did not necessarily make it to the disk. They were cached in the drive memory and (I think) the disk controller memory. If all failed, data was lost.
This is precisely why well-designed enterprise-grade storage systems disable the drive cache and rely upon some variant of striping to achieve good I/O performance.
Well I think that's the actual problem. POSIX gives you an abstract interface but it essentially does not enforce any particular semantics on those interfaces.
> why the file API so hard to use that even experts make mistakes?
Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.
(I'm not an expert in the history, just observing the surface similarity and hoping someone with more knowledge can substantiate it.)
Jeremy Allison tracked down why POSIX standardized this behavior[0].
The reason is historical and reflects a flaw in the POSIX standards process, in my opinion, one that hopefully won't be repeated in the future. I finally tracked down why this insane behavior was standardized by the POSIX committee by talking to long-time BSD hacker and POSIX standards committee member Kirk McKusick (he of the BSD daemon artwork). As he recalls, AT&T brought the current behavior to the standards committee as a proposal for byte-range locking, as this was how their current code implementation worked. The committee asked other ISVs if this was how locking should be done. The ISVs who cared about byte range locking were the large database vendors such as Oracle, Sybase and Informix (at the time). All of these companies did their own byte range locking within their own applications, none of them depended on or needed the underlying operating system to provide locking services for them. So their unanimous answer was "we don't care". In the absence of any strong negative feedback on a proposal, the committee added it "as-is", and took as the desired behavior the specifics of the first implementation, the brain-dead one from AT&T.
The most egregious part of it for me is that if I open and close a file I might be canceling some other library's lock that I'm completely unaware of.
I resisted using them in my SQLite VFS, until I partially relented for WAL locks.
I wish more platforms embraced OFD locks. macOS has them, but hidden. illumos fakes them with BSD locks (which is worse, actually). The BSDs don't add them. So it's just Linux, and Windows with sane locking. In some ways Windows is actually better (supports timeouts).
> Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.
What about the Windows API? Windows is a pretty successful OS with a less leaky FS abstraction. I know it's a totally different deal than POSIX (files can't be devices etc), the FS function calls require a seemingly absurd number of arguments, but it does seem safer and clearer what's going to happen.
By the way, LMDB's main developer Howard Chu responded to the paper. He said,
> They report on a single "vulnerability" in LMDB, in which LMDB depends on the atomicity of a single sector 106-byte write for its transaction commit semantics. Their claim is that not all storage devices may guarantee the atomicity of such a write. While I myself filed an ITS on this very topic a year ago, http://www.openldap.org/its/index.cgi/Incoming?id=7668 the reality is that all storage devices made in the past 20+ years actually do guarantee atomicity of single-sector writes. You would have to rewind back to 30 years at least, to find a HDD where this is not true.
So this is a case where the programmers of LMDB thought about the "incorrect" use and decided that it was a calculated risk to take because the incorrectness does not manifest on any recent hardware.
This is analogous to the case where someone complains some C code has undefined behavior, and the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.
Furthermore both the LMDB issue and the Postgres issue are noted in the paper to be previously known. The paper author states that Postgres documents this issue. The paper mentions pg_control so I'm guessing it's referring to this known issue here: https://wiki.postgresql.org/wiki/Full_page_writes
> We rely on 512 byte blocks (historical sector size of spinning disks) to be power-loss atomic, when we overwrite the "control file" at checkpoints.
This assumption was wrong for Intel Optane memory. Power loss could cut the data stream anywhere in the middle. (Note: the DIMM nonvolatile memory version)
consumer Optane were not "power loss protected", that is every different than not honoring a requested a synchronous write.
The crash-consistency problem is very different than the durability of real synchronous writes problem. There are some storage devices which will lie about synch writes, sometimes hoping that a backup battery will allow them to complete those write.
System crashes are inevitable, use things like write ahead logs depending on need etc... No storage API will get rid of all system crashes and yes even apple games the system by disabling real sync writes, so that will always be a battle.
You're missing the point. GP was mentioning the common assumption that all systems in the last 30 years are sector-atomic under power loss condition. Either the sector is fully written or fully not written. Optane was a rare counter example, where sector can become partially written, thus not sector-atomic.
There are known cases where power loss during a write can corrupt previously written data (data at rest). This is not some rare occurrence. This is why enterprise flash storage devices have power loss protection.
I wish someone would sell an SSD that was at most a firmware update away between regular NVMe drive and ZNS NVMe drive.
The latter just doesn't leave much room for the firmware to be clever and just swallow data.
Maybe also add a pSLC formatting mode for a namespace so one can be explicit about that capability...
It just has to be a drive that's useable as a generic gaming SSD so people can just buy it and have casual fun with it, like they did with Nvidia GTX GPUs and CUDA.
Unfortunately manifacturers almost always prefer price gouging on features that "CuStOmErS aRe NoT GoInG tO nEeD". Is it even a ZNS device available for someone who isn't a hyperscale datacenter operator nowadays?
Either you ask a manufacturer like WD, or you go to ebay AFAIK.
That said, ZNS is actually something specifically about being able to extract more value out of the same hardware (as the firmware no longer causes write amplification behind your back), which in turns means that the value for such a ZNS-capable drive ought to be strictly higher than for the traditional-only version with the same hardware.
And given that enterprise SSDs seem to only really get value from an OEM's holographic sticker on them (compare almost-new-grade used prices for those with the sticker on them vs. the just plain SSD/HDD original model number, missing the premium sticker), besides the common write-back-emergency capacitors that allow a physical write-back cache in the drive to ("safely") claim write-through semantics to the host, it should IMO be in the interest of the manufacturers to push ZNS:
ZNS makes, for ZNS-appropriate applications, the exact same hardware perform better despite requiring less fancy firmware.
Also, especially, there's much less need for write-back cache as the drive doesn't sort individual random writes into something less prone to write amplification: the host software is responsible for sorting data together for minimizing write amplification (usually, arranging for data that will likely be deleted together to be physically in the same erasure block).
Also, I'm not sure how exactly "bad" bins of flash behave, but I'd not be surprised if ZNS's support for zones having less usable space than LBA/address range occupied (which can btw. change upon recycling/erasing the zone!) would allow rather poor quality flash to still be effectively utilized, as even rather unpredictable degradation can be handled this way.
Basically, due to Copy-on-Write storage systems (like, Btrfs or many modern database backends (specifically, LSM-Tree ones)) inherently needing some slack/empty space, it's rather easy to cope with this space decreasing as a result of write operations, regardless of if the application/user data has actually grown from the writes: you just buy and add another drive/cluster-node when you run out of space, and until then, you can use 100% of the SSDs flash capacity, instead of up-front wasting capacity just to never have to decrease the drive's usable capacity over the warranty period.
Give me, say, a Samsung 990 Pro 2 TB for 250 EUR but with firmware for ZNS-reformatting, instead of the 200 EUR MSRP/173 EUR Amazon.de price for the normal version.
Oh, and please let me use a decent portion of that 2 GB LPDDR4 as controller memory buffer at least if I'm in a ZNS-only formatting situation. It's after all not needed for keeping large block translation tables around, as ZNS only needs to track where physically a logical zone is currently located (wear leveling), and which individual blocks are marked dead in that physical zone (easy linear mapping between the non-contiguous usable physical blocks and the contiguous usable logical blocks). Beyond that, I guess technically it needs to keep track of open/closed zones and write pointers and filled/valid lengths.
Furthermore, I don't even need them to warranty the device lifespan in ZNS, only that it isn't bricked from activating ZNS mode. It would be nice to get as many drive-writes warranty as the non-ZNS version gets, though.
ZNS (Zoned Namespace) technology seems to offer significant benefits by reducing write amplification and improving hardware efficiency. It makes sense that manufacturers would push for ZNS adoption, as it can enhance performance without needing complex firmware. The potential for using lower-quality flash effectively is also intriguing. However, the market dynamics, like the value added by OEM stickers and the need for write-back capacitors, complicate things. Overall, ZNS appears to be a promising advancement for specific applications.
Really? A 512-byte sector could get partially written? Did anyone actually observe this, or was it just a case of Intel CYA saying they didn't guarantee anything?
Yes, really. "Crash-consistent data structures were proposed by enforcing cacheline-level failure-atomicity" see references in: https://doi.org/10.1145/3492321.3519556
> the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.
Yeah, sounds about right about quite a lot of the C programmers except for the "they commit to checking this in the future" part. I've responses like "well, don't upgrade your compiler; I'm gonna put 'Clang >= 9.0 is unsupported' in the README as a fix".
And yet all of these systems basically work for day-to-day operations, and fail only under obscure error conditions.
It is totally acceptable for applications to say "I do not support X conditions". Swap out the file half way through a read? Sorry don't support that. Remove power to the storage devise in the middle of a sync operation? Sorry don't support that.
For vital applications, for example databases, this is a known problem and risks of the API are accounted for. Other applications don't have nearly that level of risk associated with them. My music tagging app doesn't need to be resistant to the SSD being struck by lightning.
It is perfectly acceptable to design APIs for 95% of use cases and leave extremely difficult leaks to be solved by the small number of practitioners that really need to solve those leaks.
> they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug.
This is why whenever I need to persist any kind of state to disk, SQLite is the first tool I reach for. Filesystem APIs are scary, but SQLite is well-behaved.
Of course, it doesn't always make sense to do that, like the dropbox use case.
Before becoming too overconfident in SQLite note that Rebello et al. (https://ramalagappan.github.io/pdfs/papers/cuttlefs.pdf) tested SQLite (along with Redis, LMDB, LevelDB, and PostgreSQL) using a proxy file system to simulate fsync errors and found that none of them handled all failure conditions safely.
In practice I believe I've seen SQLite databases corrupted due to what I suspect are two main causes:
1. The device powering off during the middle of a write, and
2. The device running out of space during the middle of a write.
I'm pretty sure that's not where I originally saw his comments. I remember his criticisms being a little more pointed. Although I guess "This is a bunch of academic speculation, with a total absence of real world modeling to validate the failure scenarios they presented" is pretty pointed.
I believe it is impossible to prevent dataloss if the device powers off during a write. The point about corruption still stands and appears to be used correctly from what I skimmed in the paper. Nice reference.
> I believe it is impossible to prevent dataloss if the device powers off during a write.
Most devices write sectors atomically, and so you can build a system on top of that that does not lose committed data. (Of course if the device powers off during a write then you can lose the uncommitted data you were trying to write, but the point is you don't ever have corruption, you get either the data that was there before the write attempt or the data that is there after).
Only way I know of is if you have e.g. a RAID controller with a battery-backed write cache. Even that may not be 100% reliable but it's the closest I know of. Of course that's not a software solution at all.
That's uh, not running out of power in the middle of the write. That's having extra special backup power to finish the write. If your battery dies mid cache-write-out, you're still screwed.
I wonder if, in the Pillai paper, I wonder if they tested the SQLite Rollback option with the default synchronous [1] (`NORMAL`, I believe) or with `EXTRA`. I'm thinking that it was probably the default.
I kinda think, and I could be wrong, that SQLite rollback would not have any vulnerabilities with `synchronous=EXTRA` (and `fullfsync=F_FULLFSYNC` on macOS [2]).
Although the conference this was presented at is platform-agnostic, the author is an expert on Linux, and the motivation for the talk is Linux-specific. (Dropbox dropping support for non-ext4 file systems)
The post supports its points with extensive references to prior research - research which hasn't been done in the Microsoft environment. For various reasons (NDAs, etc.) it's likely that no such research will ever be published, either. Basically it's impossible to write a post this detailed about safety issues in Microsoft file systems unless you work there. If you did, it would still take you a year or two of full-time work to do the background stuff, and when you finished, marketing and/or legal wouldn't let you actually tell anyone about it.
"Getting windows source code under NDA" doesn't necessarily mean "can do research on it".
If you can't publish it, it's not research. If the source code is under NDA, then Microsoft gets the final say about whether you can publish or not, and if the result is embarrassing to Microsoft, I'm guessing it's "or not".
Certainly depends on which APIs you ultimately use as a developer, right? If it is .NET, they're super simple, and you can get IOCP for "free" and non-blocking async I/O is quite easy to implement.
I can't say the Win32 File API is "pretty", but it's also an abstraction, like the .NET File Class is. And if you touch the NT API, you're naughty.
On Linux and macOS you use the same API, just the backends are different if you want async (epoll [blocking async] on Linux, kqueue on macOS).
The windows APIs are certainly slower. Apart from IOCP I don't think they're that much different? Oh, and mandatory locking on executable images which are loaded, which has .. advantages and disadvantages (it's why Windows keeps demanding restarts)
ZFS on Linux unfortunately has a long standing bug which makes it unusable under load: https://github.com/openzfs/zfs/issues/9130. 5.5 years old, nobody knows the root cause. Symptoms: under load (such as what one or two large concurrent rsyncs may generate over a fast network - that's how I encountered it) the pool begins to crap out and shows integrity errors and in some cases loses data (for some users - it never lost data for me). So if you do any high rate copies you _must_ hash-compare source and destination. This needs to be done after all the writes are completed to the zpool, because concurrent high rate reads seem to exacerbate the issue. Once the data is at rest, things seem to be fine. Low levels of load are also fine.
That said, there are many others who stress ZFS on a regular basis and ZFS handles the stress fine. I do not doubt that there are bugs in the code, but I feel like there are other things at play in that report. Messages saying that the txg_sync thread has hung for 120 seconds typically indicate that disk IO is running slowly due to reasons external to ZFS (and sometimes, reasons internal to ZFS, such as data deduplication).
I will try to help everyone in that issue. Thanks for bringing that to my attention. I have been less active over the past few years, so I was not aware of that mega issue.
Regarding your comment - seems unlikely that it "affects Ubuntu less". I don't see why that would be the case - it's not like Ubuntu runs a heavily customized kernel or anything. And thanks for taking a look - ZFS is just the way things should be in filesystems and logical volume management, I do wish I could stop doing hash compares after large, high throughput copies and just trust it to do what it was designed to do.
Ubuntu kernels might have a different default IO elevator than proxmox kernels. If the issue is in the IO elevator (e.g. it is reordering in such a way that some IOs are delayed indefinitely before being sent to the underlying device) and the two use different IO elevators by default, then it would make sense why Ubuntu is not affected and proxmox is. There is some evidence for this in the comments as people suggest that the issue is lessened by switching to mq-deadline. That is why one of my questions asks what Linux IO elevator people’s disks are using.
The correct IO elevator to use for disks given to ZFS is none/noop as ZFS has its own IO elevator. ZFS will set the Linux IO elevator to that automatically on disks where it controls the partitioning. However, when the partitioning was done externally from ZFS, the default Linux elevator is used underneath ZFS, and that is never none/noop in practice since other Linux filesystems benefit from other elevators. If proxmox is doing partitioning itself, then it is almost certainly using the wrong IO elevator with ZFS, unless it sets the elevator to noop when ZFS is using the device. That ordinarily should not cause such severe problems, but it is within the realm of possibility that the Linux IO elevator being set by proxmox has a bug.
I suspect there are multiple disparate issues causing the txg_sync thread to hang for people, rather than just one issue. Historically, things that cause the txg_sync thread to hang are external to ZFS (with the notable exception of data deduplication), so it is quite likely that the issues are external here too. I will watch the thread and see what feedback I get from people who are having the txg_sync thread hang.
Thanks a lot for elaborating. I'm traveling at the moment, but I'm going to try reproducing this issue once I'm back in town. IIRC I did do partitioning myself, using GPT partition table and default partition settings in fdisk.
Upd mq-deadline for all drives seems to be `none` for me. OS is Ubuntu 22.04
> Upd mq-deadline for all drives seems to be `none` for me.
I am not sure what you mean by that. One possibility is that the ones who reported mq-deadline did better were on either kyber or bfq, rather than none. The none elevator should be best for ZFS.
The mq-deadline part is what confused me. That is a competing option for the Linux IO elevator that runs under ZFS. Anyway, I understand now and thanks for the data point. I added a list of questions to GitHub that could help narrow things down if you take time to answer them. I will be watching the GitHub thread regularly, so I will see it if you post the answers after you return from your travels.
The article is about the hardware and kernel level APIs used for interacting with storage. Everything else is by necessity built on top of that interface.
"fopen"? That is outdated stuff from a shitty ecosystem, and how do you think it's implemented?
I don't get it. The only times I've had problems with filesystem corruption in the past few decades was with a hardware problem, and said hardware was quickly replaced. FAT family has been perfectly fine while I've encountered corruption on every other FS including NTFS, exFAT, and the ext* family.
Meanwhile you can read plenty of stories of others having the exact opposite experience.
If you keep losing data to power losses or crashes, perhaps fix the cause of that? It doesn't make sense to try to work around it.
> If you keep losing data to power losses or crashes, perhaps fix the cause of that? It doesn't make sense to try to work around it.
Ponder this notion for a moment: there are problems within one's control and problems outside of one's control.
For example, we can't control the weather. If it snows three feet overnight you simply have to deal with the fact that you're not getting to work today.
Since we can't simply stop hardware from failing, we have to deal with the fact that hardware fails. Your seventeen redundant UPSes might experience a one in a trillion cascade failure. It might take the utility ten minutes longer to restore your power than you have onsite generation.
This is not a class of problem we can control or prevent. We fix these problems by building systems which withstand failures. You can't just will electrons out of the wall socket, but you can build a better disk or FS that corrupts less data when the electrons stop.
There was that time (2009 or so?) I wrote 2 million files to a single directory on NTFS and that filesystem was never the same again. It didn't seem to be a hardware problem. I used to be really careful to not put a crazy number of files in a directory on Linux and Windows storing them in subdirs like
b7/b74a/b74a56
where the digits are derived from a hash of the file name but lately I've had some NTFS volumes with a 1M file directory that seem to be OK.
Hardware problems also manifest in mysterious ways. On both Windows and MacOS I had computers that seemed to be OK until I did an OS update which caused enough IO that a failing HDD was pushed over the edge and the update failed; in one case I was able to roll back the update but not apply the update, in another case the machine was trashed. Careful investigation (like taking the disk out and inspecting it on another computer) revealed a hard drive error although there was no clear indication of this in the UI and the average person would blame to software update
> Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope write to files safely, like databases and version control systems: Leveldb, LMDB, GDBM, HSQLDB, Sqlite, PostgreSQL, Git, Mercurial, HDFS, Zookeeper. They then wrote a static analysis tool that can find incorrect usage of the file API, things like incorrectly assuming that operations that aren't atomic are actually atomic, incorrectly assuming that operations that can be re-ordered will execute in program order, etc.
> When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?
> why the file API so hard to use that even experts make mistakes?
I think the short answer is that the APIs are bad. The POSIX fs APIs and associated semantics are so deeply entrenched in the software ecosystem (both at the OS level, and at the application level) that it's hard to move away from them.
I take a different view on this. IMO the tricks that existing file systems play to get more performance (specifically around ordering and atomicity) make it extra hard for developers to reason about. Obviously, you can't do anything about fsync dropping error codes, but some of these failure modes just aren't possible over file systems like NFS due to protocol semantics.
Not only that, but the POSIX file API also assumes that NFS is a thing but NFS breaks half the important guarantees of a file system. I don’t know if it’s a baby and bath water situation, but NFS just seems like a whole bunch of problems. It’s like having eval in a programming language.
The whole software ecosystem is built on bubblegum, tape, and prayers.
What aspects of NFS do you think break half of the important guarantees of a file system?
Well, at least O_APPEND, O_EXCL, O_SYNC, and flock() aren't guaranteed to work (although they can with recent versions as I understand it).
UID mapping causing read() to return -EACCES after open() succeeds breaks a lot of userland code.
Lack of inotify support is one that has annoyed me in the past. It not only breaks some desktop software, but it also should be possible for NFS to support (after all, the server sees the changes and could notify clients).
Thanks for this, it's helpful. Totally heard about O_APPEND and read() returning -EACCESS. The other ones, I agree, should be fixed in later versions of the Linux kernel/NFS client.
Just ran into this one recently trying to replace Docker w/ Podman for a CICD runner. Before anyone protests we have very strong, abnormal requirements on my project preventing most saner architectures. It wasn’t the root cause but the failure behavior was weird due to the behavior you just described.
POSIX is also so old and essential that it's hard to imagine an alternative.
Not really, there's been lots of APIs that have improved on the POSIX model.
The kind of model I prefer is something based on atomicity. Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly. (Note that something like writeFileAtomic is already a common primitive in many high-level filesystem APIs, and it's something that's already easily buildable with regular POSIX APIs). For cases like logging, you can extend the model slightly with atomic appends, where the only kind of write allowed is to atomically append a chunk of data to the file (so readers can only possibly either see no new data or the entire chunk of data at once).
I'm less knowledgeable about the way DBs interact with the filesystem, but there the solution is probably ditching the concept of the file stream entirely and just treating files as a sparse map of offsets to blocks, which can be atomically updated. (My understanding is that DBs basically do this already, except that "atomically updated" is difficult with the current APIs).
> Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly.
So now you can watch for e.g. either IN_MODIFY or IN_CLOSE_WRITE (and you don't need to balance it with IN_OPEN), it doesn't matter, you'll never see partial updates... would be nice!Surely this can’t always be true?
What happens when a lot of data is written and exceeds the dirty threshold?
It gets written on the disk but into different inodes, I imagine.
It's not hard to design a less bug-prone API that would enable you to do everything the POSIX file API permits and admits equally-high-performance implementations. But making that new API a replacement for the POSIX API would require rewriting essentially all of the software that somebody cares about to use your new, better API instead of the POSIX API. This is probably only feasible in practice for small embedded systems with a fairly small universe of software.
You could do a phased transition, where both the legacy posix api and the new api are available. This has already happened with a lot of the old C standard library. Old, unsafe functions like strcpy were gradually replaced by safer alternatives like strncpy.
Database developers don’t want the complexity or poor performance of posix. It’s wild to me that we still don’t have any alternative to fsync in Linux that can act as a barrier without also flushing caches at the same time.
There are two serious factual errors in your comment:
- This has not already happened with a lot of the old C standard library. The only function that has ever been removed from the C standard library, to my knowledge, is gets(). In particular, strcpy() has not been removed. Current popular compilers still support gets() with the right options, so it hasn't been removed from the actual library, just the standard.
- strncpy() is not a suitable replacement for strcpy(), certainly not a safer one. It can produce strings missing the terminating null, and it can be slower by orders of magnitude. This has been true since it was introduced in the 01970s. Nearly every call to strncpy() is a bug, and in many cases an exploitable security hole. You are propagating dangerous misinformation. (This is a sign of how difficult it is to make these transitions.)
You also seem to imply that Linux cannot add system calls that are not specified in POSIX, but of course it can and does; openat() and the other 12 related functions, epoll_*(), io_uring_*(), futex_*(), kexec_load(), add_key(), and many others are Linux-specific. The reason barrier() hasn't been added is evidently that the kernel developers haven't been convinced it's worthwhile in the 15+ years since it was proposed, not that POSIX ties their hands.
The nearest equivalents in C for the kind of "staged transition" you are proposing might be things like the 16-bit near/far/huge qualifiers and the Win16 and pre-X MacOS programming models. In each of these cases, a large body of pre-existing software was essentially abandoned and replaced by newly written software.
Yeah, I understand that those methods are still available. But their use is heavily discouraged in new software and a lot of validators & sanitisers will warn if your programs use them. Software itself has largely slowly moved to using the newer, safer methods even though the old methods were never taken away.
I don’t understand the reticence of kernel developers to implement a barrier syscall. I know they could do it. And as this article points out, it would dramatically improve database performance for databases which make use of it. Why hasn’t it happened?
Another commenter says NVMe doesn’t support it natively but I bet hardware vendors would add hardware support if Linux supported it and adding barrier support to their hardware would measurably improve the performance of their devices.
Sure, adding that functionality to NVMe would be easy; there are sufficient provisions around for adding such support. E.g. for example a global flag who's support is communicated and which can then be turned on by the host to cause the very same normal flush opcode to now also guarantee a pipelined write barrier (while retaining the flush-write-back-caches-before-reporting-completion-of-this-submitted-IO-operation).
The reason it hadn't yet been supported btw. is that they explicitly wanted to allow fully parallel processing of commands in a queue, at least for submissions that concurrently exist in the command queue. In practice I don't see why this would have to be enforced to such an extend, as the only reason for out-of-order processing I can think of is that the auxiliary data of a command is physically located in host memory and the DMA reads across PCIe from the NVMe controller to the host memory happen to complete out-of-order for host DRAM controller/pattern reasons. Thus it might be something you'd not want to turn on without using controller memory buffer (where you can mmap some of the DRAM on the NVMe device into host memory, write your full-detail commands directly to this across the PCIe, and keep the NVMe controller from having to first send a read request across PCIe in response to you ringing it's doorbell: instead it can directly read from it's local DRAM when you ring the doorbell).
Here is your barrier syscall:
https://www.man7.org/linux/man-pages/man2/sync.2.html
That flushes, just like the fsync() system call mentioned in Luu's post.
It actually writes dirty data and then flushes.
That said, IO barriers in storage are typically synonymous with flushes. For example, the ext4 nobarrier mount option disables flushes.
It sounds like you aren't very familiar with C; for example, C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.
Unless you mean memcpy(), there is in fact no safer alternative function in the C standard for strcpy(); software has not largely moved to not using strcpy() (plenty of new C code uses it); and most validators and sanitizers do not emit warnings for strcpy(). There is a more extensive explanation of this at https://software.codidact.com/posts/281518. GCC has warnings for some uses of strcpy(), but only those that can be statically guaranteed to be incorrect: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
Newer, safer alternatives to strcpy() include strlcpy() and strscpy() (see https://lwn.net/Articles/659818/), neither of which is in Standard C yet. Presumably OpenBSD has some sort of validator that recommends replacing strcpy() with strlcpy(), which is licensed such that you can bundle it with your program. Visual C++ will invite you to replace your strcpy() calls with the nonstandard Microsoft extension strcpy_s(), thus making your code nonportable and, as it happens, also buggy. An incompatible version of strcpy_s() has been added as an optional annex to the C11 standard. https://nullprogram.com/blog/2021/07/30/ gives extensive details, summarized as "there are no correct or practical implementations". The Linux kernel's checkpatch.pl will invite you to replace calls to strcpy() with calls to the nonstandard Linux/BSD extension strscpy(), but it's a kernel-specific linter.
So there are not literally zero validators and sanitizers that will warn on all uses of strcpy() in C, but most of them don't.
— ⁂ —
I don't know enough about the barrier()/osync() proposal to know why it hasn't been adopted, and obviously neither do you, since you can't know anything significant about Linux kernel internals if you think that C has methods or that strncpy() is a safer alternative to strcpy().
But I can speculate! I think we can exclude the following possibilities:
- That the paper, which I haven't read much of, just went unnoticed and nobody thought of the barrier() idea again. Luu points out that it's a sort of obvious idea for kernel developers; Chidambaram et al. ("Optimistic Crash Consistency") weren't even the first ones to propose it (and it wasn't even the main topic of their paper); and their paper has been cited in hundreds of other papers, largely in systems software research on SSDs: https://scholar.google.com/scholar?cites=1238063331053768604...
- That it's a good idea in theory, but implementing even a research prototype is too much work. Chidambaram et al.'s code is available at https://github.com/utsaslab/optfs, and it is of course GPLv2, so that work is already done for you. You can download a VM image from https://research.cs.wisc.edu/adsl/Software/optfs/ for testing.
- That authors of databases don't care about performance. The authors of SQLite, which is what Chidambaram et al. used in their paper, dedicate a lot of effort to continuously improving its performance: https://www.sqlite.org/cpu.html and it's also a major consideration for MariaDB and PostgreSQL.
- That there's an existing production-ready implementation that Linus is just rejecting because he's stubborn. If that were true, you'd see an active community around the OptFS patch, Red Hat applying it to their kernels (as they do with so many other non-Linus-accepted patches), etc.
- That it relies on asynchronous barrier support in the hardware interface, as the other commenter suggested. It doesn't.
So what does that leave?
Maybe the paper was wrong, which seems unlikely, or applicable only to niche cases. You should be able to build and run their benchmarks.
Maybe it was right at the time on spinning rust ("a Hitachi DeskStar 7K1000.B 1 TB drive") but wrong on SSDs, whose "seek time" is two to three orders of magnitude faster.
In particular, maybe it uses too much CPU.
Maybe it was right then and is still right but the interface has other drawbacks, for example being more bug-prone, which also seems unlikely, or undesirably constrains the architecture of other aspects of the kernel, such as the filesystem, in order to work well enough. (You could implement osync() as a filesystem-wide fsync() as a fallback, so this would just reduce the benefits, not increase the costs.)
Maybe it's obviously the right thing to do but nobody cares enough about it to step up and take responsibility for bringing the new system call up to Linus's standards and committing to maintain it over time.
If it was really a big win for database performance, you'd think one of the developers of MariaDB, PostgreSQL, or SQLite would have offered, or maybe one of the financial sponsors of the paper, which included Facebook and EMC. Luu doesn't say Twitter used the OptFS patch when he was on the Linux kernel team there; perhaps they used it secretly, but more likely they didn't find its advantages compelling enough to use.
Out of all these unlikely cases, my best guess is either "applicable only to niche cases", "wrong on SSDs", or "undesirably constrains filesystem implementation".
As a note on tone, some people may find it offputting when you speak so authoritatively about things you don't know anything about.
> C doesn't have any methods at all, neither old methods nor newer methods. There's no such thing as a method in C. This would explain your mistakes in, for example, thinking that strncpy() was safer, or that strcpy() had been removed from the C standard.
This is an overly pedantic, ungenerous interpretation of what I wrote.
First, fine - you can argue that C has functions, not methods. But eh.
Second, for all practical purposes, C on Linux does have a standard library. It’s just - as you mentioned - not quite the same on every platform. We wouldn’t be talking about strcpy if C had no standard library equivalent.
Third, thankyou for the suggestion that there are even better examples than strcpy -> strncpy that I could have used to make my point more strongly. I should have chosen sprintf, gets or scanf.
I’ve been out of the game of writing C professionally for 15 years or so. I know a whole lot more about C than most. But memories fade with time. Thanks for the corrections. Likewise, no need to get snarky with them.
NVMe has no barrier that doesn't flush the pipeline/ringbuffer of IO requests submitted to it :(
The kernel could implement a non-flushing barrier, even if the underlying device doesn't. You could even do it without any barrier support at all from the underlying device, as long as it reliably tells you when each request has completed; you just don't send it any requests from after the barrier until all the requests before the barrier have completed.
That would not work as you describe it. The device will return completion upon the writes reaching its cache. You need a flush to ensure that the data reaches stable storage.
You could probably abuse Force Unit Access to make it work by marking all IOs as Force Unit Access, but a number of buggy devices do not implement FUA properly, which defeats the purpose of using it. That would be why Microsoft disabled the NTFS feature that uses FUA on commodity hardware:
https://learn.microsoft.com/en-us/windows/win32/fileio/deplo...
What you seem to want is FreeBSD’s UFS2 Softupdates that uses force unit access to avoid the need for flushes for metadata updates. It has the downside that it is unreliable on hardware that does not implement FUA properly. Also, UFS2 softupdates does not actually implement do anything to protect data when fsync(2) is called if this mailing list email is accurate:
https://lists.freebsd.org/pipermail/freebsd-fs/2011-November...
As pjd said:
> Synchronous writes (or BIO_FLUSH) are needed to handle O_SYNC/fsync(2) properly, which UFS currently doesn't care about.
That said, avoiding flushes for a fsync(2) would require doing FUA on all IOs. Presumably, this is not done because it would make all requests take longer all the time, raising queue depths and causing things to have to wait for queue limits more often, killing performance. Raising the OS queue depth to compensate would not work since SATA has a maximum queue depth of 32, although it might work for NVMe where the maximum queue depth is 65536, if keeping track of an increased number of inflight IOs does not cause additional issues at the storage devices (such as IOs that never complete as long as the device is kept busy because the device will keep reordering them to the end of the queue).
Using FUA only on metadata as is done in UFS2 soft updates improves performance by eliminating the need for journalling in all cases but the case of space usage, which still needs journalling (or fsck after power loss if you choose to forgo it).
Writes in the POSIX API can be atomic depending on the underlying filesystem. For example, small writes on ZFS through the POSIX API are atomic since they either happen in their entirety or they do not (during power failure), although if the writes are big enough (spanning many records), they are split into separate transactions and partial writes are then possible:
https://github.com/openzfs/zfs/blob/34205715e1544d343f9a6414...
Writes on ZFS cease to be atomic around approximately 32MB in size if I read the code correctly.
> make whole file read/writes atomic with a copy-on-write model,
I have many files that are several GB. Are you sure this is a good idea? What if my application only requires best effort?
> eliminate whole classes of filesystem bugs pretty quickly.
Block level deduplication is notoriously difficult.
> where the only kind of write allowed is to atomically append a chunk of data to the file
Which sounds good until you think about the complications involved in block oriented storage medium. You're stuck with RMW whether you think you're strictly appending or not.
It doesn’t have to be one or the other. Developers could decide by passing flags to open.
But even then, doing atomic writes of multi gigabyte files doesn’t sound that hard to implement efficiently. Just write to disk first and update the metadata atomically at the end. Or whenever you choose to as a programmer.
The downside is that, when overwriting, you’ll need enough free space to store both the old and new versions of your data. But I think that’s usually a good trade off.
It would allow all sorts of useful programs to be written easily - like an atomic mode for apt, where packages either get installed or not installed. But they can’t be half installed.
Packages consist of multiple files. An atomic file write would not allow packages to be either installed or not installed by APT.
Atomicity could encompass a whole bunch of writes at once.
Databases implemented atomic transactions in the 70s. Let’s stop pretending like this is an unsolvable CS problem. Its not.
That is not what an atomic write() function does and we are talking about APT, not databases.
If you want atomic updates with APT, you could look into doing prestaged updates on ZFS. It should be possible to retrofit it into APT. Have it update a clone of the filesystem and create a new boot environment after it is done. The boot environment either is created or is not created. Then reboot into the updated OS and you can promote the clone and delete the old boot environment afterward. OpenSolaris had this capability over a decade ago.
> Databases implemented atomic transactions in the 70s.
And they have deadlocks as a result, which there is no good easy solution to (generally we work around by having only one program access a given database at a time, and even that is not 100% reliable).
Eh. Deadlocks can be avoided if you don’t use sql’s exact semantics. For example, foundationdb uses mvcc such that if two conflicting write transactions are committed at the same time, one transaction succeeds and the other is told to retry.
It works great in practice, even with a lot of concurrent clients. (iCloud is all built on foundationdb).
Hold & lock is what causes deadlocks. I agree with you - that would be a bad way to implement filesystem transactions. But we have a lot of other options.
This is kind of an interesting thought that more mirrors how Docker uses OverlayFS to track changes to the entire file system. No need for new file APIs.
It can also use ZFS to do this.
> Developers could decide by passing flags to open.
Provided the underlying VFS has implemented them. They may not. Hence the point in the article that some developers only choose to support 'ext4' and nothing else.
> you’ll need enough free space to store both the old and new versions of your data.
The sacrifice is increased write wear on solid state devices.
> It would allow all sorts of useful programs to be written easily
Sure. As long as you don't need multiple processes to access the same file simultaneously. I think the article misses this point, too, in that, every FS on a multi user system is effectively a "distributed system." It's not distributed for _redundancy_ but it doesn't eliminate the attendant challenges.
Dropbox reversed its stance on this. It added support for ZFS, XFS, ecryptfs and btrfs:
https://help.dropbox.com/installs/system-requirements
They say ecryptfs is only supported when it is backed by ext4, which is a bit strange. I wonder if that is documented just to be able to close support cases when ecryptfs is used on top of a filesystem that is missing extended attribute support and their actual code does not actually check what is below ecryptfs. Usually the application above would not know what is below ecryptfs, so they would need to go out of their way to check this in order to enforce that. I do not use Dropbox, so someone else would need to test to see if they actually enforce that if curious enough.
Yes, a feature like this would need cooperation with the filesystem. But that’s just an implementation problem. That’s like saying we can’t add flexbox to browsers because all the browsers would need to add it. So?
As for wear on SSDs, I don’t think it would increase wear. You’re writing the same number of sectors on the drive. A 2gb write would still write 2gb (+ negligible metadata overhead). Why would the drive wear out faster in this scheme?
And I think it would work way better with multiple processes than the existing system. Right now the semantics when multiple processes edit the same file at once are somewhat undefined. With this approach, files would have database like semantics where any reader would either see the state before a write or the state after. It’s much cleaner - since it would become impossible for skewed reads or writes to corrupt a shared file.
Would you argue against the existence of database transactions? Of course not. Nobody does. They’re a great idea, and they’re way easier to reason about and use correctly compared to the POSIX filesystem api. I’m saying we should have the same integrity guarantees on the filesystem. I think if we had those guarantees already, you’d agree too.
Some of the problems transcend POSIX. Someone I know maintains a non-relational db on IBM mainframes. When diving into a data issue, he was gob-smacked to find out that sync'd writes did not necessarily make it to the disk. They were cached in the drive memory and (I think) the disk controller memory. If all failed, data was lost.
This is precisely why well-designed enterprise-grade storage systems disable the drive cache and rely upon some variant of striping to achieve good I/O performance.
Just wait till he has to deal with raid controllers.
I use Plan 9 regularly and while its Unix heritage is there, it most certainly isn't Unix and completely does away with POSIX.
> POSIX fs APIs and associated semantics
Well I think that's the actual problem. POSIX gives you an abstract interface but it essentially does not enforce any particular semantics on those interfaces.
> why the file API so hard to use that even experts make mistakes?
Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.
(I'm not an expert in the history, just observing the surface similarity and hoping someone with more knowledge can substantiate it.)
POSIX file locking is clearly modeled around whatever was simplest to implement, although it makes no sense at all.
Jeremy Allison tracked down why POSIX standardized this behavior[0].
The reason is historical and reflects a flaw in the POSIX standards process, in my opinion, one that hopefully won't be repeated in the future. I finally tracked down why this insane behavior was standardized by the POSIX committee by talking to long-time BSD hacker and POSIX standards committee member Kirk McKusick (he of the BSD daemon artwork). As he recalls, AT&T brought the current behavior to the standards committee as a proposal for byte-range locking, as this was how their current code implementation worked. The committee asked other ISVs if this was how locking should be done. The ISVs who cared about byte range locking were the large database vendors such as Oracle, Sybase and Informix (at the time). All of these companies did their own byte range locking within their own applications, none of them depended on or needed the underlying operating system to provide locking services for them. So their unanimous answer was "we don't care". In the absence of any strong negative feedback on a proposal, the committee added it "as-is", and took as the desired behavior the specifics of the first implementation, the brain-dead one from AT&T.
[0] https://www.samba.org/samba/news/articles/low_point/tale_two...
The most egregious part of it for me is that if I open and close a file I might be canceling some other library's lock that I'm completely unaware of.
I resisted using them in my SQLite VFS, until I partially relented for WAL locks.
I wish more platforms embraced OFD locks. macOS has them, but hidden. illumos fakes them with BSD locks (which is worse, actually). The BSDs don't add them. So it's just Linux, and Windows with sane locking. In some ways Windows is actually better (supports timeouts).
> Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.
What about the Windows API? Windows is a pretty successful OS with a less leaky FS abstraction. I know it's a totally different deal than POSIX (files can't be devices etc), the FS function calls require a seemingly absurd number of arguments, but it does seem safer and clearer what's going to happen.
Why does that seem more likely than file system API simply not having been a major factor in the success of failure of OSes?
By the way, LMDB's main developer Howard Chu responded to the paper. He said,
> They report on a single "vulnerability" in LMDB, in which LMDB depends on the atomicity of a single sector 106-byte write for its transaction commit semantics. Their claim is that not all storage devices may guarantee the atomicity of such a write. While I myself filed an ITS on this very topic a year ago, http://www.openldap.org/its/index.cgi/Incoming?id=7668 the reality is that all storage devices made in the past 20+ years actually do guarantee atomicity of single-sector writes. You would have to rewind back to 30 years at least, to find a HDD where this is not true.
So this is a case where the programmers of LMDB thought about the "incorrect" use and decided that it was a calculated risk to take because the incorrectness does not manifest on any recent hardware.
This is analogous to the case where someone complains some C code has undefined behavior, and the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.
Furthermore both the LMDB issue and the Postgres issue are noted in the paper to be previously known. The paper author states that Postgres documents this issue. The paper mentions pg_control so I'm guessing it's referring to this known issue here: https://wiki.postgresql.org/wiki/Full_page_writes
> We rely on 512 byte blocks (historical sector size of spinning disks) to be power-loss atomic, when we overwrite the "control file" at checkpoints.
This assumption was wrong for Intel Optane memory. Power loss could cut the data stream anywhere in the middle. (Note: the DIMM nonvolatile memory version)
consumer Optane were not "power loss protected", that is every different than not honoring a requested a synchronous write.
The crash-consistency problem is very different than the durability of real synchronous writes problem. There are some storage devices which will lie about synch writes, sometimes hoping that a backup battery will allow them to complete those write.
System crashes are inevitable, use things like write ahead logs depending on need etc... No storage API will get rid of all system crashes and yes even apple games the system by disabling real sync writes, so that will always be a battle.
You're missing the point. GP was mentioning the common assumption that all systems in the last 30 years are sector-atomic under power loss condition. Either the sector is fully written or fully not written. Optane was a rare counter example, where sector can become partially written, thus not sector-atomic.
It is not rare for flash storage devices to lose data on power loss, even data that is FLUSH'd. See https://news.ycombinator.com/item?id=38371307
There are known cases where power loss during a write can corrupt previously written data (data at rest). This is not some rare occurrence. This is why enterprise flash storage devices have power loss protection.
See also: https://serverfault.com/questions/923971/is-there-a-way-to-p...
I wish someone would sell an SSD that was at most a firmware update away between regular NVMe drive and ZNS NVMe drive. The latter just doesn't leave much room for the firmware to be clever and just swallow data.
Maybe also add a pSLC formatting mode for a namespace so one can be explicit about that capability...
It just has to be a drive that's useable as a generic gaming SSD so people can just buy it and have casual fun with it, like they did with Nvidia GTX GPUs and CUDA.
Unfortunately manifacturers almost always prefer price gouging on features that "CuStOmErS aRe NoT GoInG tO nEeD". Is it even a ZNS device available for someone who isn't a hyperscale datacenter operator nowadays?
Either you ask a manufacturer like WD, or you go to ebay AFAIK.
That said, ZNS is actually something specifically about being able to extract more value out of the same hardware (as the firmware no longer causes write amplification behind your back), which in turns means that the value for such a ZNS-capable drive ought to be strictly higher than for the traditional-only version with the same hardware.
And given that enterprise SSDs seem to only really get value from an OEM's holographic sticker on them (compare almost-new-grade used prices for those with the sticker on them vs. the just plain SSD/HDD original model number, missing the premium sticker), besides the common write-back-emergency capacitors that allow a physical write-back cache in the drive to ("safely") claim write-through semantics to the host, it should IMO be in the interest of the manufacturers to push ZNS:
ZNS makes, for ZNS-appropriate applications, the exact same hardware perform better despite requiring less fancy firmware. Also, especially, there's much less need for write-back cache as the drive doesn't sort individual random writes into something less prone to write amplification: the host software is responsible for sorting data together for minimizing write amplification (usually, arranging for data that will likely be deleted together to be physically in the same erasure block).
Also, I'm not sure how exactly "bad" bins of flash behave, but I'd not be surprised if ZNS's support for zones having less usable space than LBA/address range occupied (which can btw. change upon recycling/erasing the zone!) would allow rather poor quality flash to still be effectively utilized, as even rather unpredictable degradation can be handled this way. Basically, due to Copy-on-Write storage systems (like, Btrfs or many modern database backends (specifically, LSM-Tree ones)) inherently needing some slack/empty space, it's rather easy to cope with this space decreasing as a result of write operations, regardless of if the application/user data has actually grown from the writes: you just buy and add another drive/cluster-node when you run out of space, and until then, you can use 100% of the SSDs flash capacity, instead of up-front wasting capacity just to never have to decrease the drive's usable capacity over the warranty period.
That said: https://priceblaze.com/0TS2109-WesternDigital-Solid-State-Dr... claims (by part number) to be this model: https://www.westerndigital.com/en-ae/products/internal-drive... . That's about 150 $/TB. Refurbished; doesn't say how much life has been sucked out of them.
Give me, say, a Samsung 990 Pro 2 TB for 250 EUR but with firmware for ZNS-reformatting, instead of the 200 EUR MSRP/173 EUR Amazon.de price for the normal version.
Oh, and please let me use a decent portion of that 2 GB LPDDR4 as controller memory buffer at least if I'm in a ZNS-only formatting situation. It's after all not needed for keeping large block translation tables around, as ZNS only needs to track where physically a logical zone is currently located (wear leveling), and which individual blocks are marked dead in that physical zone (easy linear mapping between the non-contiguous usable physical blocks and the contiguous usable logical blocks). Beyond that, I guess technically it needs to keep track of open/closed zones and write pointers and filled/valid lengths.
Furthermore, I don't even need them to warranty the device lifespan in ZNS, only that it isn't bricked from activating ZNS mode. It would be nice to get as many drive-writes warranty as the non-ZNS version gets, though.
ZNS (Zoned Namespace) technology seems to offer significant benefits by reducing write amplification and improving hardware efficiency. It makes sense that manufacturers would push for ZNS adoption, as it can enhance performance without needing complex firmware. The potential for using lower-quality flash effectively is also intriguing. However, the market dynamics, like the value added by OEM stickers and the need for write-back capacitors, complicate things. Overall, ZNS appears to be a promising advancement for specific applications.
[dead]
Really? A 512-byte sector could get partially written? Did anyone actually observe this, or was it just a case of Intel CYA saying they didn't guarantee anything?
Yes, really. "Crash-consistent data structures were proposed by enforcing cacheline-level failure-atomicity" see references in: https://doi.org/10.1145/3492321.3519556
That reference appears to link to a DoI that doesn't actually exist.
This is called “Atomic Write Unit Power Failure” (AWUPF).
> the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.
Yeah, sounds about right about quite a lot of the C programmers except for the "they commit to checking this in the future" part. I've responses like "well, don't upgrade your compiler; I'm gonna put 'Clang >= 9.0 is unsupported' in the README as a fix".
> why the file API so hard to use that even experts make mistakes?
Because it was poorly designed, and there is a high resistance to change, so those design mistakes from decades ago continue to bite
Something this misses is that all programs make assumptions for example - “my process is the only one writing this file because it created it”
Evaluating correctness without that consideration is too high of a bar.
Safety and correctness cannot be “impossible to misuse”
And yet all of these systems basically work for day-to-day operations, and fail only under obscure error conditions.
It is totally acceptable for applications to say "I do not support X conditions". Swap out the file half way through a read? Sorry don't support that. Remove power to the storage devise in the middle of a sync operation? Sorry don't support that.
For vital applications, for example databases, this is a known problem and risks of the API are accounted for. Other applications don't have nearly that level of risk associated with them. My music tagging app doesn't need to be resistant to the SSD being struck by lightning.
It is perfectly acceptable to design APIs for 95% of use cases and leave extremely difficult leaks to be solved by the small number of practitioners that really need to solve those leaks.
"PostgreSQL vs. fsync - How is it possible that PostgreSQL used fsync incorrectly for 20 years" - https://youtu.be/1VWIGBQLtxo
Ext4 actually special-handles the rename trick so that it works even if it should not:
"If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and [basically save your ass]"[0]
[0]https://docs.kernel.org/admin-guide/ext4.html
> they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug.
This is why whenever I need to persist any kind of state to disk, SQLite is the first tool I reach for. Filesystem APIs are scary, but SQLite is well-behaved.
Of course, it doesn't always make sense to do that, like the dropbox use case.
Before becoming too overconfident in SQLite note that Rebello et al. (https://ramalagappan.github.io/pdfs/papers/cuttlefs.pdf) tested SQLite (along with Redis, LMDB, LevelDB, and PostgreSQL) using a proxy file system to simulate fsync errors and found that none of them handled all failure conditions safely.
In practice I believe I've seen SQLite databases corrupted due to what I suspect are two main causes:
1. The device powering off during the middle of a write, and
2. The device running out of space during the middle of a write.
I remembered Howard Chu commenting on that paper...
https://lists.openldap.org/hyperkitty/list/openldap-devel@op...
I'm pretty sure that's not where I originally saw his comments. I remember his criticisms being a little more pointed. Although I guess "This is a bunch of academic speculation, with a total absence of real world modeling to validate the failure scenarios they presented" is pretty pointed.
I believe it is impossible to prevent dataloss if the device powers off during a write. The point about corruption still stands and appears to be used correctly from what I skimmed in the paper. Nice reference.
> I believe it is impossible to prevent dataloss if the device powers off during a write.
Most devices write sectors atomically, and so you can build a system on top of that that does not lose committed data. (Of course if the device powers off during a write then you can lose the uncommitted data you were trying to write, but the point is you don't ever have corruption, you get either the data that was there before the write attempt or the data that is there after).
Only way I know of is if you have e.g. a RAID controller with a battery-backed write cache. Even that may not be 100% reliable but it's the closest I know of. Of course that's not a software solution at all.
That's uh, not running out of power in the middle of the write. That's having extra special backup power to finish the write. If your battery dies mid cache-write-out, you're still screwed.
I file that under hardware failure, not mundane power loss.
If the file system uses strict COW it should survive that situation.
>SQLite is the first tool I reach for.
Hopefully in whichever particular mode is referenced!
WAL mode, yes!
Do you turn on SQLite checksumming or how do you feel comfortable that data on disk stays keeps integrity?
As per HN headlines, files are hard, git is hard, regex is hard, time zones are hard, money as data type is hard, hiring is hard, people is hard.
I wonder what is easy.
Complaining :)
Selection error. The stuff that always works doesn't get posted here.
To reuse another HN headline, all this is probably because no one really cares X-)
I wonder if, in the Pillai paper, I wonder if they tested the SQLite Rollback option with the default synchronous [1] (`NORMAL`, I believe) or with `EXTRA`. I'm thinking that it was probably the default.
I kinda think, and I could be wrong, that SQLite rollback would not have any vulnerabilities with `synchronous=EXTRA` (and `fullfsync=F_FULLFSYNC` on macOS [2]).
[1]: https://www.sqlite.org/pragma.html#pragma_synchronous
[2]: https://www.sqlite.org/pragma.html#pragma_fullfsync
No mention on ntfs and windows keywords in the article, for those interested.
Although the conference this was presented at is platform-agnostic, the author is an expert on Linux, and the motivation for the talk is Linux-specific. (Dropbox dropping support for non-ext4 file systems)
The post supports its points with extensive references to prior research - research which hasn't been done in the Microsoft environment. For various reasons (NDAs, etc.) it's likely that no such research will ever be published, either. Basically it's impossible to write a post this detailed about safety issues in Microsoft file systems unless you work there. If you did, it would still take you a year or two of full-time work to do the background stuff, and when you finished, marketing and/or legal wouldn't let you actually tell anyone about it.
Universities can get Windows source code under NDA and do research on it but nobody really cares about such work.
"Getting windows source code under NDA" doesn't necessarily mean "can do research on it".
If you can't publish it, it's not research. If the source code is under NDA, then Microsoft gets the final say about whether you can publish or not, and if the result is embarrassing to Microsoft, I'm guessing it's "or not".
Is that because the windows APIs are better? Or because businesses build their embedded systems/servers with Windows?
Certainly depends on which APIs you ultimately use as a developer, right? If it is .NET, they're super simple, and you can get IOCP for "free" and non-blocking async I/O is quite easy to implement.
I can't say the Win32 File API is "pretty", but it's also an abstraction, like the .NET File Class is. And if you touch the NT API, you're naughty.
On Linux and macOS you use the same API, just the backends are different if you want async (epoll [blocking async] on Linux, kqueue on macOS).
The windows APIs are certainly slower. Apart from IOCP I don't think they're that much different? Oh, and mandatory locking on executable images which are loaded, which has .. advantages and disadvantages (it's why Windows keeps demanding restarts)
I doubt that, was just curious how it might compare in the article.
> On Linux ZFS, it appears that there's a code path designed to do the right thing, but CPU usage spikes and the system may hang or become unusable.
ZFS fsync will not fail, although it could end up waiting forever when a pool faults due to hardware failures:
https://papers.freebsd.org/2024/asiabsdcon/norris_openzfs-fs...
ZFS on Linux unfortunately has a long standing bug which makes it unusable under load: https://github.com/openzfs/zfs/issues/9130. 5.5 years old, nobody knows the root cause. Symptoms: under load (such as what one or two large concurrent rsyncs may generate over a fast network - that's how I encountered it) the pool begins to crap out and shows integrity errors and in some cases loses data (for some users - it never lost data for me). So if you do any high rate copies you _must_ hash-compare source and destination. This needs to be done after all the writes are completed to the zpool, because concurrent high rate reads seem to exacerbate the issue. Once the data is at rest, things seem to be fine. Low levels of load are also fine.
There are actually several distinct issues being reported there. I replied responding to everyone who posted backtraces and a few who did not:
https://github.com/openzfs/zfs/issues/9130#issuecomment-2614...
That said, there are many others who stress ZFS on a regular basis and ZFS handles the stress fine. I do not doubt that there are bugs in the code, but I feel like there are other things at play in that report. Messages saying that the txg_sync thread has hung for 120 seconds typically indicate that disk IO is running slowly due to reasons external to ZFS (and sometimes, reasons internal to ZFS, such as data deduplication).
I will try to help everyone in that issue. Thanks for bringing that to my attention. I have been less active over the past few years, so I was not aware of that mega issue.
Regarding your comment - seems unlikely that it "affects Ubuntu less". I don't see why that would be the case - it's not like Ubuntu runs a heavily customized kernel or anything. And thanks for taking a look - ZFS is just the way things should be in filesystems and logical volume management, I do wish I could stop doing hash compares after large, high throughput copies and just trust it to do what it was designed to do.
Ubuntu kernels might have a different default IO elevator than proxmox kernels. If the issue is in the IO elevator (e.g. it is reordering in such a way that some IOs are delayed indefinitely before being sent to the underlying device) and the two use different IO elevators by default, then it would make sense why Ubuntu is not affected and proxmox is. There is some evidence for this in the comments as people suggest that the issue is lessened by switching to mq-deadline. That is why one of my questions asks what Linux IO elevator people’s disks are using.
The correct IO elevator to use for disks given to ZFS is none/noop as ZFS has its own IO elevator. ZFS will set the Linux IO elevator to that automatically on disks where it controls the partitioning. However, when the partitioning was done externally from ZFS, the default Linux elevator is used underneath ZFS, and that is never none/noop in practice since other Linux filesystems benefit from other elevators. If proxmox is doing partitioning itself, then it is almost certainly using the wrong IO elevator with ZFS, unless it sets the elevator to noop when ZFS is using the device. That ordinarily should not cause such severe problems, but it is within the realm of possibility that the Linux IO elevator being set by proxmox has a bug.
I suspect there are multiple disparate issues causing the txg_sync thread to hang for people, rather than just one issue. Historically, things that cause the txg_sync thread to hang are external to ZFS (with the notable exception of data deduplication), so it is quite likely that the issues are external here too. I will watch the thread and see what feedback I get from people who are having the txg_sync thread hang.
Thanks a lot for elaborating. I'm traveling at the moment, but I'm going to try reproducing this issue once I'm back in town. IIRC I did do partitioning myself, using GPT partition table and default partition settings in fdisk.
Upd mq-deadline for all drives seems to be `none` for me. OS is Ubuntu 22.04
> Upd mq-deadline for all drives seems to be `none` for me.
I am not sure what you mean by that. One possibility is that the ones who reported mq-deadline did better were on either kyber or bfq, rather than none. The none elevator should be best for ZFS.
I mean that it's already "none" on the machine where I encountered this bug. "Upd" was merely to signal that I've made an edit to my post.
The mq-deadline part is what confused me. That is a competing option for the Linux IO elevator that runs under ZFS. Anyway, I understand now and thanks for the data point. I added a list of questions to GitHub that could help narrow things down if you take time to answer them. I will be watching the GitHub thread regularly, so I will see it if you post the answers after you return from your travels.
I was confused, actually, not you. The output was:
cat /sys/dev/block/8:176/queue/scheduler
[mq-deadline] none
However, this output does not mean what I thought it did - it means that mq-deadline is in use.
If I do
echo "none" | sudo tee /sys/dev/block/8:176/queue/scheduler
This changes to
cat /sys/dev/block/8:176/queue/scheduler
[none] mq-deadline
The article wrap up with this salient point:
> In conclusion, computers don't work (but I guess you already know this...
They work.
Just not all the time.
No Javascript or SNI:
https://archive.wikiwix.com/cache/index2.php?rev_t=&url=http...
it's a good thing I'm a Web developer.
closest I come to working with files is localStorage, but that's thread safe.
this whole thing is a story about using outdated stuff in a shitty ecosystem.
its not a real problem for most modern developers.
pwrite? wtf?
not one mention of fopen.
granted some of the fine detail discussion is interesting, but it doesn't make practical sense since about 1990.
The article is about the hardware and kernel level APIs used for interacting with storage. Everything else is by necessity built on top of that interface.
"fopen"? That is outdated stuff from a shitty ecosystem, and how do you think it's implemented?
I don't get it. The only times I've had problems with filesystem corruption in the past few decades was with a hardware problem, and said hardware was quickly replaced. FAT family has been perfectly fine while I've encountered corruption on every other FS including NTFS, exFAT, and the ext* family.
Meanwhile you can read plenty of stories of others having the exact opposite experience.
If you keep losing data to power losses or crashes, perhaps fix the cause of that? It doesn't make sense to try to work around it.
> If you keep losing data to power losses or crashes, perhaps fix the cause of that? It doesn't make sense to try to work around it.
Ponder this notion for a moment: there are problems within one's control and problems outside of one's control.
For example, we can't control the weather. If it snows three feet overnight you simply have to deal with the fact that you're not getting to work today.
Since we can't simply stop hardware from failing, we have to deal with the fact that hardware fails. Your seventeen redundant UPSes might experience a one in a trillion cascade failure. It might take the utility ten minutes longer to restore your power than you have onsite generation.
This is not a class of problem we can control or prevent. We fix these problems by building systems which withstand failures. You can't just will electrons out of the wall socket, but you can build a better disk or FS that corrupts less data when the electrons stop.
There was that time (2009 or so?) I wrote 2 million files to a single directory on NTFS and that filesystem was never the same again. It didn't seem to be a hardware problem. I used to be really careful to not put a crazy number of files in a directory on Linux and Windows storing them in subdirs like
where the digits are derived from a hash of the file name but lately I've had some NTFS volumes with a 1M file directory that seem to be OK.Hardware problems also manifest in mysterious ways. On both Windows and MacOS I had computers that seemed to be OK until I did an OS update which caused enough IO that a failing HDD was pushed over the edge and the update failed; in one case I was able to roll back the update but not apply the update, in another case the machine was trashed. Careful investigation (like taking the disk out and inspecting it on another computer) revealed a hard drive error although there was no clear indication of this in the UI and the average person would blame to software update
> If you keep losing data to power losses or crashes, perhaps fix the cause of that?
I keep telling my users to make sure to plug their phones in before the battery dies, but for some reason they keep forgetting...
Phones shut down when close, but before they hit zero battery
Then that's entirely their fault. They deserve all the corruption they get.
Seems like I hit a nerve. Apparently teaching users responsibility is a bad thing?
No wonder things are "hard". Because otherwise many in this godforsaken industry wouldn't need to be employed.