I've been patiently waiting to convert my ZFS array to bcachefs. I'm very excited about better SSD promotion logic. But I'm not willing to spend any time on an experimental filesystem on my main systems.
> But you can expect to get flamed about running Debian, or perhaps more accurately, not being willing to spearhead Kent's crusade against Debian's Rust packaging policies.
It is quite unfortunate that Kent couldn't have just said "Debian isn't supported, we will revisit this when bcachefs is more stable" and stopped talking after that. Debian and experimental software just don't work well together.
Oh, the author's completely misrepresenting what happened here.
We had a major snafu with Debian, where a maintainer volunteered to package bcachefs-tools (this was not something I ever asked for!), and I explained that Debian policy on Rust dependencies would cause problems, and asked him not to do that.
But he did debundle, and then down the road he broke the build (by debundled bindgen and ignoring the minimum version we'd specified), and then _sat on it for months_, without even reporting the issue.
So Debian users weren't getting updates, and that meant they didn't get a critical bugfix that fixed passing of mount options.
Then a bunch of Debian users weren't able to mount in degraded mode when a drive died. And guess who was fielding bug reports?
After that is when I insisted that if bcachefs-tools is packaged for debian, dependencies don't get debundled.
If you're going to create a mess and drop it in my lap, and it results in users not able to access their filesystem, don't go around bitching about being asked to not repeat that.
That last one’s great advice. I don’t remember if you can use checkboxes there and I’m too lazy to look at the moment, but I could imagine the first question being:
I seem to recall a previous fs creator with ego problems was tried and convicted of murder, and then his work unceremoniously disappeared into an oubliette.
I’m 99% sure you’re joking but as an outsider I have… concerns.
Genuinly curious: it seems like you are making a remark on his character, right? But why did you do so? Just fed up? Or did he actually state something wrong in the parent comment?
I've been running bcachefs on my spare dedicated SteamOS gaming machine for fun. Especially for the SSD promotion logic. It's a spare computer with an old 128GB SSD and 3TB HDD that I've got as a single filesystem. I love not having to manage games between the SSD/HDD. Too bad it's a mITX build with no space for more old drives I could stick in.
I need to write up my experience. But I'm trying it out. Linux needs something like this. I've had issues, posted traces and had them fixed in seconds. Pretty damn amazing. I'd love to see a bigger team involved though.
Confidence intervals don’t have precise timelines associated with them. Sometimes you know exactly what the problem is when you hear the symptoms.
We always balance new work versus cleanup. I always have a laundry list of beefs with my own work. You often have a sneaking suspicion that this piece of code is wrong in a manner you can’t quite put your finger on. Or you do but a new customer is trumping a hypothetical error. Or ten people are blocked waiting for you to merge some other feature. And then someone gives you a test case and you know exactly what is wrong with it. Sometimes that’s fixing a null check, an off by one error, or sorting results. And sometimes the repro case basically writes itself.
Bcachefs is experimental, and last I heard, the authors hope to be able to declare it not experimental in ~6 months. To me, that's 'try on a cache server/build server' territory, not on anything where you even think about backing it up.
Kent Overstreet does have a problem working with people; I can well believe that interactions around bugs were painful. He likely he should try to hire someone else to deal with users and distributions and use his limited patience to deal with the kernel maintainers. But it sounds like the OP was a bit naive about what it means to run an experimental FS.
It's only been in the upstream kernel 6 months or so?
I know the FS has actually been around longer, but I've not heard claims about it's stability before that. It's only now that's it's getting a lot more testing
Given that it's a DB under the hood, we really need some new terminology to understand what's happening when there's a disc format change. I think that some of these format changes are in fact schema changes, which makes them a bit less risky (but not entirely risk free). I get the impression that after it is considered stable there may still be schema changes.
Yes, the low level format for btree nodes/btree keys isn't changing, and the fact that it's a database does ease a lot of backwards and forwards compatibility issues - for example, most key types can have new fields appended and old versions will just ignore the new field.
But schema changes can still be pretty impactful.
The big on disk format changes were:
- accounting rewrite in 6.11, which switched from a high performance, but complex and difficult to extend mechanism where accounting was stored in memory with percpu counters and then bundled up in each journal entry, to storing them in btree keys and journalling them as deltas.
The runtime code was so vastly different that it just wasn't possible to keep writing out the old accounting without retaining all of the old code side by side with the new code - it would have been really gross.
- backpointers change in 6.14, for fsck scalability improvements
This one would have been possible to do without rewriting all the backpointers, but that still would've had severe downsides. The main option would've meant permanently inflating the size of the backpointers btree by ~20%, IIRC, and backpointers are almost half of all metadata on a typical filesystem. There were ways to get around that, but they would've added a lot of complexity that we would've had to carry around forever.
But those are exactly the sorts of things the experimental phase is for - getting it on more and bigger systems and seeing what shakes out and finding out what we still need.
Honestly the ability to do those major changes in place are pretty exciting and might be a draw towards BcacheFs for me. As long as it doesn't happen too often (with the experimental phase as an exception) it's really nice to be able to benefit from major features and improvements.
I really dislike the attitude some FS devs have that if you want some improvement made after you setup a system you should just reformat and restore from backup.
Sure a 5 hour metadata rewrite is disruptive, but it's nothing compared to a 5 day backup, reprovision, restore dance. The former can run over night on a weekend, the latter takes basically the entire work week. Therefore it won't get done until all the infrastructure is EOL, and replaced with something else 3-7 years down the line.
I should not have implied that every bug report would be a bad interaction - and to be fair where I've seen it degenerate in the mailing list, it's often the other person wasn't exactly civil to start with (which unfortunately includes Linus Torvalds). It's just that Kent seems to feel the need to fight back.
A file system is an emotional investment of a decade+ so I can understand Kent's desire to defend it the way he does.
That said the kernel is the bigger picture, if he wants it to be successful and remain in mainline he needs to adapt to how Linus expects maintainers to operate. That probably means delaying getting fixes to mainline users until he can show they have been adequately tested. Until that trust is built bcachefs is always a few bad cycles away from being dropped from mainline.
Linus can be a bit of a dick sometimes (and hey, so can I), but he's not prone to being rash or a petty tyrant. He just rides people a bit hard.
Following process to the letter isn't what we should be doing right now; this isn't the time for slowing down to cross every t and dot every i, this is the time for getting fixes pushed out quickly so we can move on to the next thing.
But things have also been stabilizing dramatically, it's already looking like I'll have a lot less out of merge window stuff to send this cycle (there wasn't much that I would've sent last cycle, either).
Another thing is that I'm half expecting bcachefs to be pulled from mainline. That's my prediction for 2025, I guess? Not that I want that to happen, but Kent's behaviour has been ... difficult, and Linus already hinted this is a possibility.
I appreciate that Kent means well, but as an outsider reading some of his communications ... yeah, I'd struggle working with that too.
I was benchmarking filesystems by generating a billion empty files on them, and while ext2, ext4, and btrfs could finish in a day or two, xfs hit a wall in the first 4 million files, and was on track to take weeks. After hitting ctrl-c, it took 20+ minutes to get a responsive shell, and unmounting took a few minutes.
This surprised me, because it's been around for decades, and I expected scalability, performance, and robustness. A one-user accidental DOS was a surprising rake to trip over.
XFS is well regarded. It is kind of a second-generation of UINX filesystem that was designed from ground up with journaling and extents and btrees and xattrs.
EXT4 is well regarded too, it has similar goodies now but it evolved from a first-generation stile without journaling and with block (not extent) tracking, primitive kinds of trees implemented as N-indirect arrays, etc. Whether that really matters is debatable, these days I think it's well acknowledged that you have to evolve and adapt to new features and clean redesigns aren't necessarily better (XFS has gone through a bunch of new changes too, e.g., metadata checksums).
XFS came from SGI IRIX and there it was really focused on datapath throughput and scalability (think, HPC systems or graphics clusters working through enormous source and result data files). And they were much less focused on metadata performance.
XFS certainly was much slower and more heavyweight than EXT doing things like creating files, but there has been a lot of work on that over many years and it's a lot better than it was. That said, a billion files is a fairly extreme corner and it's not necessarily what's required for most users.
This paper I found is similar, although probably has many differences from exactly what you were doing:
XFS creation performance was around par with EXT4 up to 100M files, but took 2x as long to create 1B. Maybe yours did much worse because you aren't splitting files into subdirectories but creating them all in one? That could suggest some XFS structure has a higher order of complexity and is starting to blow out a bit, directory index for example.
XFS often comes out pretty high up there on other benchmarks
I don't know what caused your experience, but I've had the opposite experience using XFS with many small files. It's my filesystem of choice for simple (single device) use cases.
The main reason is that XFS automatically allocates inode space. With ext4, I would quickly run out of the default inode quota while the volume was nowhere near full, and then manually tune the inode to block ratio to accommodate more files. XFS took care of that automatically. Performance was otherwise identical, and I've never seen any data loss bugs or even crashes from either one.
How many decades ago was that? Sounds more like a partition converted from ext3.
No ext4 partition I've seen in the last 15y, didn't have a ridiculous amount of inodes. I do support for several hundred Linux systems.
Zero decades ago? I run EC2 instances that process hundreds of millions of small files. I always use the latest Ubuntu LTS.
I'm also tired of trying to share my experience and having to choose between ignoring snide ignorant low-brow dismissals, and leaving them unanswered so they can misinform people. Ext4 does not have a dynamic inode reservation adjustment feature. XFS does. So with ext4, it's possible to run out of inodes while there are blocks available. With XFS, it's not.
> EXT4 is the general-purpose filesystem for a majority of Linux distributions and gets installed as the default. However, it is unable to accommodate one billion files by default due to a small number of inodes available, typically set at 240 million files or folders, created during installation.
Which is interesting. I knew EXT2/3/4 had inode bitmaps, but I haven't been paying them much attention for the past decade. Slightly surprised they haven't added an option for dynamic allocation, OTOH inodes are small compared with storage and most people don't need billions of files.
That person is being extremely silly when they call that "small".
Ext2/3/4 reserves so many inodes by default. One per 16KB of drive space. You don't hit that with normal use. Almost everyone should be reducing their inode count so it doesn't take up 1.6% of the drive.
> That person is being extremely silly when they call that "small".
Explain.
> Ext2/3/4 reserves so many inodes by default. One per 16KB of drive space. You don't hit that with normal use. Almost everyone should be reducing their inode count so it doesn't take up 1.6% of the drive.
Well almost, but not the OP who runs out of inode space with the default format.
Hundreds of millions of inodes is not a small number. I'm not sure how I can explain that much better. There are multiple orders of magnitude between "240 million inodes" and "a small number of inodes".
And on a 14TB hard drive, the default would be more like 850 million.
> Well almost, but not the OP who runs out of inode space with the default format.
I said "almost" for a reason. It's a bad idea for quite small drives or some rare use cases.
> Hundreds of millions of inodes is not a small number. I'm not sure how I can explain that much better. There are multiple orders of magnitude between "240 million inodes" and "a small number of inodes".
The poster with the inode problem in the subthread I replied to said "many small files" and "millions of small files", not "a small number of inodes".
> I said "almost" for a reason. It's a bad idea for quite small drives or some rare use cases.
Though source trees often are smaller than 16kB/inode, so the advice to decrease inode allocation just to save a fraction of 1.6% of space may not be too good. I would leave it as default unless there is good reason to change. And that's of course the trouble with fixed inode allocation.
> The poster with the inode problem in the subthread I replied to said "many small files" and "millions of small files", not "a small number of inodes".
I quoted the paper. I said nothing about OP.
> Though source trees often are smaller than 16kB/inode, so the advice to decrease inode allocation just to save a fraction of 1.6% of space may not be too good.
For drives that aren't going to have tons of uncompiled source trees, more than a percent is a lot. That could be a hundred gigabytes wasted. Almost any drive that isn't a small OS partition can do just fine at .4 or .2 percent.
That passage is in context of a billion inodes though, where it is a small number by comparison. It's obviously calling it a small number by some absolute or objective standard.
> For drives that aren't going to have tons of uncompiled source trees, more than a percent is a lot. That could be a hundred gigabytes wasted. Almost any drive that isn't a small OS partition can do just fine at .4 or .2 percent.
If you have 10TB then saving 80GB isn't really a lot at all. It's about 0.8%. You really wouldn't recommend an average user change that at all. If they were desperate for a few more GB you might drop the reserved % down a couple of points first which is already more than entire inode allocation.
> That paper is not saying it's small "in comparison", it's saying it's small for a filesystem. It's silly.
It's not, it's clearly saying the inode count is too small for their 1 billion file test.
> Yes I would. If they were stuck with ext4.
I meant good advice. There's enough stories of people running out of inodes even of the tiny sample in this thread that it's not good advice to cut the inode count 4x.
Super small drives are a special case for sure. You don't want to go below a certain number of inodes even on a 5GB root, but you also don't want to scale that up 50x on a 250GB root.
> Next you clone the firefox-source hg repo, which will use about 15% of the space and 80% of the inodes.
Looking at my mozilla checkout the source and repo average 6KB per file, which would eat lots of inodes.
But once I compile it, it's more like 20KB per file, which is just fine on default settings. So I'm not sure if the inodes are actually the limiting factor in this scenario?
And now that they're moving to git, the file count will be about 70% smaller for the same amount of data.
Based on the mke2fs git history, the default has been a 256 byte inode per 16KB of drive size since 2008, and a 128 byte inode per 8KB of drive size before that.
His side is that basically it doesn't make sense to package the tools for an experimental FS in an OS that's going to get very far behind, as debian stable will do, since the tools have to iterate rapidly to fix problems with releases of the FS. Debian-stable had an old set of some rust libraries and the packager relaxed the dependency constraints in order to package it, which doesn't make a lot of sense on something you are hoping to fix your filesystem.
Basically it shouldn't be packaged for LTS-class OS releases until it's not experimental.
Kent's issue IIRC was that they loosened the dependencies, full-stop. Debian presumably isn't replicating his entire testing regime when they change all the dependencies being compiled in. The potential exists for a bug to be introduced that is genuinely consequential to the user, moreso than for, say, a broken calendar app. Combine that with a rapidly changing FS and the fact that any issues would likely be blamed on his FS and I can see why he might feel that way.
If a distro chose to build some program with a different set of dependencies than what is specified by its author, then arguably it is not the same piece of software anymore. If Debian wants to do that - which I think they 100% have the rights to do, this is free software after all - they should make it clear that they will also take over the maintenance responsibility for it as well.
Distros generally do try to do this, but it doesn't matter much in practice, not enough users go to the distro by default (in part because the distro maintainers, quite understandably, can't actually solve most of the problems, even the ones they caused). So a distro that releases a broken package inevitably causes a support burden for upstream.
(I personally agree that debian's "let's just relax version dependencies on rust projects and hope that it works" policy is insane: a recipe for all kinds of subtle breakage that no-one working upstream is even trying to avoid. And this kind of fiddling is not unique to rust, either, nor is bcachefs the first project to say "don't use debian's packages")
> they should make it clear that they will also take over the maintenance responsibility for it as well.
Don't they already? I was always told that if you hit a bug in a distro package, you report it to the maintainer, and then if applicable they can pass it upstream. The whole point of a distro (at least the kind Debian is) is to be its own thing.
A large part of the issue here was the snafu resulted in Debian users not getting updates for months for a critical fix, and then not being able to mount in degraded mode, and I was the one fielding those bug reports.
Okay, so ask if they're using your packages or the distro packages, and if it's the latter tell them to talk to the person who maintains those packages. If it was me I would have that be part of the bug report form up front, but I don't know what your process is.
That's just passing the buck, it doesn't do anything for the users who were affected by the very real screw up
It also doesn't work in practice because I have to do most of the diagnosis before I find it if it's my bug or Debian's.
The only real solution is for Debian to be better at working with upstream, and not do things they've been told are going to cause problems, and not drop the ball when they do.
> It also doesn't work in practice because I have to do most of the diagnosis before I find it if it's my bug or Debian's.
No, you ask that first and if it's a downstream package you stop working on it. If the downstream maintainer determines that it is on your end and not theirs, then you can pick it back up.
> The only real solution is for Debian to be better at working with upstream, and not do things they've been told are going to cause problems, and not drop the ball when they do.
Assuming that "working with upstream" means "adopting upstream code in direct contravention of their own policies": If your solution depends on Debian not being Debian, then it's unlikely to work.
> No, you ask that first and if it's a downstream package you stop working on it. If the downstream maintainer determines that it is on your end and not theirs, then you can pick it back up.
They're not going to devote that kind of time. That just means bugs not getting looked at or fixed.
> Assuming that "working with upstream" means "adopting upstream code in direct contravention of their own policies": If your solution depends on Debian not being Debian, then it's unlikely to work.
I'm not sure why you think that policies that are causing problems shouldn't change.
Again: they volunteered to package it, they did it badly and users were affected. Until they can get their act together, bcachefs-tools won't be in Debian.
That's ok. There are other distros, and there's no rush.
Distros do this all the time with C libraries. Until newer languages added support for lock files and stuff, it was pretty normal to use slightly different dependencies.
> [bcachefs auhor's] side is that basically it doesn't make sense to package the tools for an experimental FS in an OS that's going to get very far behind
If the bcachefs author's believe the tool is too unstable for Debian stable and Debian developer's believe bcachefs is too unstable for Debian stable [1], it sounds as if they agree.
To my knowledge, Debian changed deps in bcachefs-tools to synchronize with Debian's Rust repo, breaking it. It's one part that it's a green fs, and the other with clashing expectations of how to treat dependencies between Rust and Debian.
Perhaps down the road, after experimental is lifted. For now I've generally been telling distros to slow down.
For anything this big staging the release is important, we don't need or want to be in every distro right away (the Fedora people were also _extremely_ gung ho in the past and I told them the same thing).
Until every outstanding critical bug report is fixed (and there are still a few), I want power users testing it who know how to deal with things breaking, I don't want unsuspecting users getting bit.
Devil's advocate: You may find that it better achieves your goal to provide packages and say "For power users testing only". You might also be able to petition Debian to have it removed if it's leading to people foot-gunning.
If I can help with deb/rpm packaging, let me know. I'm a power user here, excited about bcachefs, but haven't tried it because I don't have the time to give it a worthy test (remembering how much time I put into getting ZFS stable on Linux). Thanks for your hard work on it!
It seems honestly odd to me not to just vendor what you need as standalone packages if your dependencies are that specific and you're a filesystem i.e. you use the bcachefs-errno package, not errno.
Debian tends to put their principles above pragmatism (for better or worse), so would they even agree with such vendoring when it's entirely meant as a way to bypass their vision/requirements/process for how dependencies should be handled?
That particular principle is bourne of pragmatism; Debian long ago learned the lesson that other distros are determined to relearn ( https://thenewstack.io/vendoring-why-you-still-have-overlook...) - vendoring is not good for security. In fact, I have come to view Debian's commitment to principles as almost always a practical matter, because those principles (almost?) always trade short-term pain for long-term quality and stability.
It's also effectively what the big cloud vendors do with their monorepos. This makes sense when you have upstreams which are slow at upgrading (e.g. it looks like Debian is upgrading packages packages using older bootstrap to bootstrap v5 across the board, and such fixes get pushed upstream; there's also tooling to watch new releases, so Debian's tooling effectively acts like a system-wide dependabot).
This isn't a Debian problem though, that's the point: if bcachefs-tools has such specific dependencies, then why doesn't it vendor it's own dependencies so it's clear they are packaged and used independently?
A bunch of the packages in the release at hand were actually upgraded, not downgraded, for example.
A counterargument would be that Rust+Cargo pins specific versions already. If you’re writing Rust, you should rarely need to vendor anything unless you’re maintaining a patched version or something.
Vendoring also bypasses the package cache and build cache. If 2 apps depend on foo-1.2.3, they can normally share the cached package and its build artifacts.
Basically, Cargo goes a long way toward ensuring you rarely need to bother with adding someone else’s code to your repo.
Oh, guess it does. I've been using sccache so long that I'd forgotten that.
Do you know off-hand why it doesn't, though? If 2 packages use foo 1.2 with the same features and, say, the default build settings, why not share them by default?
I think at the time Cargo was made it was just far simpler to implement. It's not just that, it's also rustc version, sometimes environment variables... much less likely to cause problems by keeping it per project. Of course that stuff still needs to be kept track of, but like, "to get a clean build, kill this directory" seems easier. Not sure if there is an explicit justification written down anywhere from 11 years ago.
Some of the zpool configurations I have set up seem "odd" to others, but they do have a purpose and are well thought out.
What motivated you to combine ZFS and bcache? What did that configuration look like? What was the thinking behind it? I would prefer to inquire further rather than make and present wrong assumptions as to why the configuration is wrong.
Yes and no—said user ran this as an experiment (relevant quote: "But with a filesystem this young, you inevitably have to accept some rough edges in return for some fun."), but eventually stopped it partially due to the people involved (relevant quotes: "But here's the main issue for me: Reporting bugs to bcachefs is not a pleasant experience."; "And at some point, the fun just stopped for me."), not just because there were bugs.
Every filesystem has had data corruption bugs. Unfortunately, that's just how it is with the complexity of modern filesystems and the tools we have.
(We're still coding in bloody _C_).
What you really don't want is to be losing entire filesytems. A single file here and there getting corrupted is one thing, and it's usually highly workload dependent so it'll probably only hit one application, but losing an entire filesystem is much more impactful.
ext 2/3/4 has the best track record here (and note, it's also had data corruption bugs), with the relatively simple on disk format and e2fsck being quite robust. XFS is probably next up, I've seen reports of XFS filesystems being lost but only one or two - a tiny, tiny fraction of the btrfs reports. Can't speak as much to ZFS.
Ok.It is sand solid.
There were a lot of voices on HN trying to downplay the issue, but, a filesystem cannot be named stable, if it corrupts data, in 2020.
I've been patiently waiting to convert my ZFS array to bcachefs. I'm very excited about better SSD promotion logic. But I'm not willing to spend any time on an experimental filesystem on my main systems.
> But you can expect to get flamed about running Debian, or perhaps more accurately, not being willing to spearhead Kent's crusade against Debian's Rust packaging policies.
It is quite unfortunate that Kent couldn't have just said "Debian isn't supported, we will revisit this when bcachefs is more stable" and stopped talking after that. Debian and experimental software just don't work well together.
Oh, the author's completely misrepresenting what happened here.
We had a major snafu with Debian, where a maintainer volunteered to package bcachefs-tools (this was not something I ever asked for!), and I explained that Debian policy on Rust dependencies would cause problems, and asked him not to do that.
But he did debundle, and then down the road he broke the build (by debundled bindgen and ignoring the minimum version we'd specified), and then _sat on it for months_, without even reporting the issue.
So Debian users weren't getting updates, and that meant they didn't get a critical bugfix that fixed passing of mount options.
Then a bunch of Debian users weren't able to mount in degraded mode when a drive died. And guess who was fielding bug reports?
After that is when I insisted that if bcachefs-tools is packaged for debian, dependencies don't get debundled.
If you're going to create a mess and drop it in my lap, and it results in users not able to access their filesystem, don't go around bitching about being asked to not repeat that.
Yeah just typical Debian stuff. jwz has been ranting about this for years. It's not worth spending any time on it.
Some suggestions:
- Only "supporting" the latest mainline kernel and latest tools. I prefer to point to CI system configurations to show exactly what it "supported"
- Make this clear via your website and a pinned issue on Github.
- Force users to report the versions they use via an issue template: https://docs.github.com/en/communities/using-templates-to-en.... Immediately close any issues not meeting your version/system requirements without further discussion or thought.
That last one’s great advice. I don’t remember if you can use checkboxes there and I’m too lazy to look at the moment, but I could imagine the first question being:
and auto-closing if set.Do you ever admit you're wrong?
I think I did once back in 2002.
I seem to recall a previous fs creator with ego problems was tried and convicted of murder, and then his work unceremoniously disappeared into an oubliette.
I’m 99% sure you’re joking but as an outsider I have… concerns.
It does help to have a sense of humor :)
That was a good one! Keep up your humor. It's a tough environment out there.
Genuinly curious: it seems like you are making a remark on his character, right? But why did you do so? Just fed up? Or did he actually state something wrong in the parent comment?
I've been running bcachefs on my spare dedicated SteamOS gaming machine for fun. Especially for the SSD promotion logic. It's a spare computer with an old 128GB SSD and 3TB HDD that I've got as a single filesystem. I love not having to manage games between the SSD/HDD. Too bad it's a mITX build with no space for more old drives I could stick in.
I need to write up my experience. But I'm trying it out. Linux needs something like this. I've had issues, posted traces and had them fixed in seconds. Pretty damn amazing. I'd love to see a bigger team involved though.
My experience also. Kent is obviously very committed to the project.
A change to a filesystem should never be made in seconds.
Confidence intervals don’t have precise timelines associated with them. Sometimes you know exactly what the problem is when you hear the symptoms.
We always balance new work versus cleanup. I always have a laundry list of beefs with my own work. You often have a sneaking suspicion that this piece of code is wrong in a manner you can’t quite put your finger on. Or you do but a new customer is trumping a hypothetical error. Or ten people are blocked waiting for you to merge some other feature. And then someone gives you a test case and you know exactly what is wrong with it. Sometimes that’s fixing a null check, an off by one error, or sorting results. And sometimes the repro case basically writes itself.
Bcachefs is experimental, and last I heard, the authors hope to be able to declare it not experimental in ~6 months. To me, that's 'try on a cache server/build server' territory, not on anything where you even think about backing it up.
Kent Overstreet does have a problem working with people; I can well believe that interactions around bugs were painful. He likely he should try to hire someone else to deal with users and distributions and use his limited patience to deal with the kernel maintainers. But it sounds like the OP was a bit naive about what it means to run an experimental FS.
> last I heard, the authors hope to be able to declare it not experimental in ~6 months
How long have they been saying that? I feel like that's been the case for years.
It's only been in the upstream kernel 6 months or so?
I know the FS has actually been around longer, but I've not heard claims about it's stability before that. It's only now that's it's getting a lot more testing
Searching bcachefs on LWN (https://lwn.net/Search/DoTextSearch) does make for a somewhat interesting read e.g. https://lwn.net/Articles/717379/ (from 2017) which suggests the disc format would have one last change (c.f. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... which is for the upcoming 6.14 release, which notes the format will have at least one more changes after the one included), and the numerous articles on its upstreaming (e.g. https://lwn.net/Articles/755276/ from 2018), and the stability claims do seem a bit, well, optimistic?
Given that it's a DB under the hood, we really need some new terminology to understand what's happening when there's a disc format change. I think that some of these format changes are in fact schema changes, which makes them a bit less risky (but not entirely risk free). I get the impression that after it is considered stable there may still be schema changes.
Yes, the low level format for btree nodes/btree keys isn't changing, and the fact that it's a database does ease a lot of backwards and forwards compatibility issues - for example, most key types can have new fields appended and old versions will just ignore the new field.
But schema changes can still be pretty impactful.
The big on disk format changes were: - accounting rewrite in 6.11, which switched from a high performance, but complex and difficult to extend mechanism where accounting was stored in memory with percpu counters and then bundled up in each journal entry, to storing them in btree keys and journalling them as deltas.
The runtime code was so vastly different that it just wasn't possible to keep writing out the old accounting without retaining all of the old code side by side with the new code - it would have been really gross.
- backpointers change in 6.14, for fsck scalability improvements This one would have been possible to do without rewriting all the backpointers, but that still would've had severe downsides. The main option would've meant permanently inflating the size of the backpointers btree by ~20%, IIRC, and backpointers are almost half of all metadata on a typical filesystem. There were ways to get around that, but they would've added a lot of complexity that we would've had to carry around forever.
But those are exactly the sorts of things the experimental phase is for - getting it on more and bigger systems and seeing what shakes out and finding out what we still need.
Honestly the ability to do those major changes in place are pretty exciting and might be a draw towards BcacheFs for me. As long as it doesn't happen too often (with the experimental phase as an exception) it's really nice to be able to benefit from major features and improvements.
I really dislike the attitude some FS devs have that if you want some improvement made after you setup a system you should just reformat and restore from backup.
Sure a 5 hour metadata rewrite is disruptive, but it's nothing compared to a 5 day backup, reprovision, restore dance. The former can run over night on a weekend, the latter takes basically the entire work week. Therefore it won't get done until all the infrastructure is EOL, and replaced with something else 3-7 years down the line.
It was merged in 6.7, 12 months ago...
FWIW, I've done a few bcachefs bug reports and thought Kent's responses were great.
I should not have implied that every bug report would be a bad interaction - and to be fair where I've seen it degenerate in the mailing list, it's often the other person wasn't exactly civil to start with (which unfortunately includes Linus Torvalds). It's just that Kent seems to feel the need to fight back.
A file system is an emotional investment of a decade+ so I can understand Kent's desire to defend it the way he does.
That said the kernel is the bigger picture, if he wants it to be successful and remain in mainline he needs to adapt to how Linus expects maintainers to operate. That probably means delaying getting fixes to mainline users until he can show they have been adequately tested. Until that trust is built bcachefs is always a few bad cycles away from being dropped from mainline.
Linus can be a bit of a dick sometimes (and hey, so can I), but he's not prone to being rash or a petty tyrant. He just rides people a bit hard.
Following process to the letter isn't what we should be doing right now; this isn't the time for slowing down to cross every t and dot every i, this is the time for getting fixes pushed out quickly so we can move on to the next thing.
But things have also been stabilizing dramatically, it's already looking like I'll have a lot less out of merge window stuff to send this cycle (there wasn't much that I would've sent last cycle, either).
So things should be pretty well calmed down.
Another thing is that I'm half expecting bcachefs to be pulled from mainline. That's my prediction for 2025, I guess? Not that I want that to happen, but Kent's behaviour has been ... difficult, and Linus already hinted this is a possibility.
I appreciate that Kent means well, but as an outsider reading some of his communications ... yeah, I'd struggle working with that too.
Is XFS well-regarded by others?
I was benchmarking filesystems by generating a billion empty files on them, and while ext2, ext4, and btrfs could finish in a day or two, xfs hit a wall in the first 4 million files, and was on track to take weeks. After hitting ctrl-c, it took 20+ minutes to get a responsive shell, and unmounting took a few minutes.
This surprised me, because it's been around for decades, and I expected scalability, performance, and robustness. A one-user accidental DOS was a surprising rake to trip over.
XFS is well regarded. It is kind of a second-generation of UINX filesystem that was designed from ground up with journaling and extents and btrees and xattrs.
EXT4 is well regarded too, it has similar goodies now but it evolved from a first-generation stile without journaling and with block (not extent) tracking, primitive kinds of trees implemented as N-indirect arrays, etc. Whether that really matters is debatable, these days I think it's well acknowledged that you have to evolve and adapt to new features and clean redesigns aren't necessarily better (XFS has gone through a bunch of new changes too, e.g., metadata checksums).
XFS came from SGI IRIX and there it was really focused on datapath throughput and scalability (think, HPC systems or graphics clusters working through enormous source and result data files). And they were much less focused on metadata performance.
XFS certainly was much slower and more heavyweight than EXT doing things like creating files, but there has been a lot of work on that over many years and it's a lot better than it was. That said, a billion files is a fairly extreme corner and it's not necessarily what's required for most users.
This paper I found is similar, although probably has many differences from exactly what you were doing:
https://arxiv.org/html/2408.01805v1
XFS creation performance was around par with EXT4 up to 100M files, but took 2x as long to create 1B. Maybe yours did much worse because you aren't splitting files into subdirectories but creating them all in one? That could suggest some XFS structure has a higher order of complexity and is starting to blow out a bit, directory index for example.
XFS often comes out pretty high up there on other benchmarks
https://www.phoronix.com/review/linux-611-filesystems/3
So as always, the better performing one is going to depend on what your exact workload is, especially if you are doing something unusual.
There's not really any reason to use ext4 over xfs anymore, unless you need fscrypt or shrinking.
Reflink support is super useful.
> Maybe yours did much worse because you aren't splitting files into subdirectories but creating them all in one?
No, and also, I'd expect that to be awful. 1000 folders, each with 1000 folders, each with 1000 files.
Those Arxiv and Phoronix links are great!
>https://www.phoronix.com/review/linux-611-filesystems/3
It's a shame that ZFS was not included.
Probably because it's benchmarking Linux filesystems.
I don't know what caused your experience, but I've had the opposite experience using XFS with many small files. It's my filesystem of choice for simple (single device) use cases.
The main reason is that XFS automatically allocates inode space. With ext4, I would quickly run out of the default inode quota while the volume was nowhere near full, and then manually tune the inode to block ratio to accommodate more files. XFS took care of that automatically. Performance was otherwise identical, and I've never seen any data loss bugs or even crashes from either one.
How many decades ago was that? Sounds more like a partition converted from ext3. No ext4 partition I've seen in the last 15y, didn't have a ridiculous amount of inodes. I do support for several hundred Linux systems.
Zero decades ago? I run EC2 instances that process hundreds of millions of small files. I always use the latest Ubuntu LTS.
I'm also tired of trying to share my experience and having to choose between ignoring snide ignorant low-brow dismissals, and leaving them unanswered so they can misinform people. Ext4 does not have a dynamic inode reservation adjustment feature. XFS does. So with ext4, it's possible to run out of inodes while there are blocks available. With XFS, it's not.
From this paper https://arxiv.org/html/2408.01805v1 (2024)
> EXT4 is the general-purpose filesystem for a majority of Linux distributions and gets installed as the default. However, it is unable to accommodate one billion files by default due to a small number of inodes available, typically set at 240 million files or folders, created during installation.
Which is interesting. I knew EXT2/3/4 had inode bitmaps, but I haven't been paying them much attention for the past decade. Slightly surprised they haven't added an option for dynamic allocation, OTOH inodes are small compared with storage and most people don't need billions of files.
That person is being extremely silly when they call that "small".
Ext2/3/4 reserves so many inodes by default. One per 16KB of drive space. You don't hit that with normal use. Almost everyone should be reducing their inode count so it doesn't take up 1.6% of the drive.
> That person is being extremely silly when they call that "small".
Explain.
> Ext2/3/4 reserves so many inodes by default. One per 16KB of drive space. You don't hit that with normal use. Almost everyone should be reducing their inode count so it doesn't take up 1.6% of the drive.
Well almost, but not the OP who runs out of inode space with the default format.
> Explain.
Hundreds of millions of inodes is not a small number. I'm not sure how I can explain that much better. There are multiple orders of magnitude between "240 million inodes" and "a small number of inodes".
And on a 14TB hard drive, the default would be more like 850 million.
> Well almost, but not the OP who runs out of inode space with the default format.
I said "almost" for a reason. It's a bad idea for quite small drives or some rare use cases.
> Hundreds of millions of inodes is not a small number. I'm not sure how I can explain that much better. There are multiple orders of magnitude between "240 million inodes" and "a small number of inodes".
The poster with the inode problem in the subthread I replied to said "many small files" and "millions of small files", not "a small number of inodes".
> I said "almost" for a reason. It's a bad idea for quite small drives or some rare use cases.
Though source trees often are smaller than 16kB/inode, so the advice to decrease inode allocation just to save a fraction of 1.6% of space may not be too good. I would leave it as default unless there is good reason to change. And that's of course the trouble with fixed inode allocation.
> The poster with the inode problem in the subthread I replied to said "many small files" and "millions of small files", not "a small number of inodes".
I quoted the paper. I said nothing about OP.
> Though source trees often are smaller than 16kB/inode, so the advice to decrease inode allocation just to save a fraction of 1.6% of space may not be too good.
For drives that aren't going to have tons of uncompiled source trees, more than a percent is a lot. That could be a hundred gigabytes wasted. Almost any drive that isn't a small OS partition can do just fine at .4 or .2 percent.
> I quoted the paper. I said nothing about OP.
That passage is in context of a billion inodes though, where it is a small number by comparison. It's obviously calling it a small number by some absolute or objective standard.
> For drives that aren't going to have tons of uncompiled source trees, more than a percent is a lot. That could be a hundred gigabytes wasted. Almost any drive that isn't a small OS partition can do just fine at .4 or .2 percent.
If you have 10TB then saving 80GB isn't really a lot at all. It's about 0.8%. You really wouldn't recommend an average user change that at all. If they were desperate for a few more GB you might drop the reserved % down a couple of points first which is already more than entire inode allocation.
That paper is not saying it's small "in comparison", it's saying it's small for a filesystem. It's silly.
> You really wouldn't recommend an average user change that at all.
Yes I would. If they were stuck with ext4.
In general I'd recommend something with checksums.
> reserved %
Oh god definitely change that, it should cap at 10GB or something.
> That paper is not saying it's small "in comparison", it's saying it's small for a filesystem. It's silly.
It's not, it's clearly saying the inode count is too small for their 1 billion file test.
> Yes I would. If they were stuck with ext4.
I meant good advice. There's enough stories of people running out of inodes even of the tiny sample in this thread that it's not good advice to cut the inode count 4x.
I ran a very small personal webserver with limited storage on Ubuntu on EC2 for a while.
The EC2 instance, likely the smallest configuration available at the time, hit an inode limit just running updates over time.
Super small drives are a special case for sure. You don't want to go below a certain number of inodes even on a 5GB root, but you also don't want to scale that up 50x on a 250GB root.
Why are super small drives a special case? It's still the same data to inode ratio.
There's a lot of small files that come with your typical OS install, going into the first handfuls of gigabytes.
When you add on another terabyte, the distribution is totally different. The files are much bigger.
Right, so it's entirely about usage rather than filesystem size.
Different sizes have different uses.
With gentoo, if you allocate let's say 20G to / on ext4, then you can quite easily run into this issue.
/usr/src/linux will use about 30% of the space and 10% of the inodes.
/var/db/repos/gentoo will use about 4% of the space and 10% of the inodes.
Next you clone the firefox-source hg repo, which will use about 15% of the space and 80% of the inodes.
> Next you clone the firefox-source hg repo, which will use about 15% of the space and 80% of the inodes.
Looking at my mozilla checkout the source and repo average 6KB per file, which would eat lots of inodes.
But once I compile it, it's more like 20KB per file, which is just fine on default settings. So I'm not sure if the inodes are actually the limiting factor in this scenario?
And now that they're moving to git, the file count will be about 70% smaller for the same amount of data.
Based on the mke2fs git history, the default has been a 256 byte inode per 16KB of drive size since 2008, and a 128 byte inode per 8KB of drive size before that.
i had ext4 telling out of space with 52% in df lol
i just converted it inline to btrfs
What's all that about the bcachefs author's complaints with Rust and Debian? I'm far out of the loop on this stuff.
His side is that basically it doesn't make sense to package the tools for an experimental FS in an OS that's going to get very far behind, as debian stable will do, since the tools have to iterate rapidly to fix problems with releases of the FS. Debian-stable had an old set of some rust libraries and the packager relaxed the dependency constraints in order to package it, which doesn't make a lot of sense on something you are hoping to fix your filesystem.
Basically it shouldn't be packaged for LTS-class OS releases until it's not experimental.
Kent's issue IIRC was that they loosened the dependencies, full-stop. Debian presumably isn't replicating his entire testing regime when they change all the dependencies being compiled in. The potential exists for a bug to be introduced that is genuinely consequential to the user, moreso than for, say, a broken calendar app. Combine that with a rapidly changing FS and the fact that any issues would likely be blamed on his FS and I can see why he might feel that way.
If a distro chose to build some program with a different set of dependencies than what is specified by its author, then arguably it is not the same piece of software anymore. If Debian wants to do that - which I think they 100% have the rights to do, this is free software after all - they should make it clear that they will also take over the maintenance responsibility for it as well.
Distros generally do try to do this, but it doesn't matter much in practice, not enough users go to the distro by default (in part because the distro maintainers, quite understandably, can't actually solve most of the problems, even the ones they caused). So a distro that releases a broken package inevitably causes a support burden for upstream.
(I personally agree that debian's "let's just relax version dependencies on rust projects and hope that it works" policy is insane: a recipe for all kinds of subtle breakage that no-one working upstream is even trying to avoid. And this kind of fiddling is not unique to rust, either, nor is bcachefs the first project to say "don't use debian's packages")
> they should make it clear that they will also take over the maintenance responsibility for it as well.
Don't they already? I was always told that if you hit a bug in a distro package, you report it to the maintainer, and then if applicable they can pass it upstream. The whole point of a distro (at least the kind Debian is) is to be its own thing.
Oh, if only.
A large part of the issue here was the snafu resulted in Debian users not getting updates for months for a critical fix, and then not being able to mount in degraded mode, and I was the one fielding those bug reports.
Okay, so ask if they're using your packages or the distro packages, and if it's the latter tell them to talk to the person who maintains those packages. If it was me I would have that be part of the bug report form up front, but I don't know what your process is.
That's just passing the buck, it doesn't do anything for the users who were affected by the very real screw up
It also doesn't work in practice because I have to do most of the diagnosis before I find it if it's my bug or Debian's.
The only real solution is for Debian to be better at working with upstream, and not do things they've been told are going to cause problems, and not drop the ball when they do.
> It also doesn't work in practice because I have to do most of the diagnosis before I find it if it's my bug or Debian's.
No, you ask that first and if it's a downstream package you stop working on it. If the downstream maintainer determines that it is on your end and not theirs, then you can pick it back up.
> The only real solution is for Debian to be better at working with upstream, and not do things they've been told are going to cause problems, and not drop the ball when they do.
Assuming that "working with upstream" means "adopting upstream code in direct contravention of their own policies": If your solution depends on Debian not being Debian, then it's unlikely to work.
> No, you ask that first and if it's a downstream package you stop working on it. If the downstream maintainer determines that it is on your end and not theirs, then you can pick it back up.
They're not going to devote that kind of time. That just means bugs not getting looked at or fixed.
> Assuming that "working with upstream" means "adopting upstream code in direct contravention of their own policies": If your solution depends on Debian not being Debian, then it's unlikely to work.
I'm not sure why you think that policies that are causing problems shouldn't change.
Again: they volunteered to package it, they did it badly and users were affected. Until they can get their act together, bcachefs-tools won't be in Debian.
That's ok. There are other distros, and there's no rush.
Distros do this all the time with C libraries. Until newer languages added support for lock files and stuff, it was pretty normal to use slightly different dependencies.
Lock files _are_ code. Same for meson.build or CMakeList.txt. We don't use to have the ability to precisely specify dependencies in code, now we do.
Building with a patched codebase is effectively building a fork.
> [bcachefs auhor's] side is that basically it doesn't make sense to package the tools for an experimental FS in an OS that's going to get very far behind
If the bcachefs author's believe the tool is too unstable for Debian stable and Debian developer's believe bcachefs is too unstable for Debian stable [1], it sounds as if they agree.
[1] https://jonathancarter.org/2024/08/29/orphaning-bcachefs-too...
Ah, I see. So "the Rust team" here means the team of Debian maintainers who package Rust for it, not the Rust developers? That makes sense.
Yeah I assume that's what "the Rust team" meant
This makes perfect sense to me. Why would Debian-stable include something obviously not stable. They have an experimental branch for a reason
To my knowledge, Debian changed deps in bcachefs-tools to synchronize with Debian's Rust repo, breaking it. It's one part that it's a green fs, and the other with clashing expectations of how to treat dependencies between Rust and Debian.
Debian would compile dependencies with versions lower than specified in the project, reintroducing bugs that users would blame upstream for, https://www.phoronix.com/news/Debian-Orphans-Bcachefs-Tools
Why is Kent not providing his own deb packages that users can install to override the Debian provided ones to get updates?
Perhaps down the road, after experimental is lifted. For now I've generally been telling distros to slow down.
For anything this big staging the release is important, we don't need or want to be in every distro right away (the Fedora people were also _extremely_ gung ho in the past and I told them the same thing).
Until every outstanding critical bug report is fixed (and there are still a few), I want power users testing it who know how to deal with things breaking, I don't want unsuspecting users getting bit.
Devil's advocate: You may find that it better achieves your goal to provide packages and say "For power users testing only". You might also be able to petition Debian to have it removed if it's leading to people foot-gunning.
If I can help with deb/rpm packaging, let me know. I'm a power user here, excited about bcachefs, but haven't tried it because I don't have the time to give it a worthy test (remembering how much time I put into getting ZFS stable on Linux). Thanks for your hard work on it!
I'd love for someone to pick up the deb packaging. My only requirement is no unbundling :)
If we can't get it into the official debian repo like that, we can just ship it as a ppa for now.
Already using it on a 3 drive gaming PC on NixOS and it's been great so far. The caching algo massively speeds up interactions and load times.
Hope you continue the good work and best of luck.
Because he isn't a Debian packager. That isn't his job.
Got it. I could see that being an issue.
It seems honestly odd to me not to just vendor what you need as standalone packages if your dependencies are that specific and you're a filesystem i.e. you use the bcachefs-errno package, not errno.
Debian tends to put their principles above pragmatism (for better or worse), so would they even agree with such vendoring when it's entirely meant as a way to bypass their vision/requirements/process for how dependencies should be handled?
That particular principle is bourne of pragmatism; Debian long ago learned the lesson that other distros are determined to relearn ( https://thenewstack.io/vendoring-why-you-still-have-overlook...) - vendoring is not good for security. In fact, I have come to view Debian's commitment to principles as almost always a practical matter, because those principles (almost?) always trade short-term pain for long-term quality and stability.
It's also effectively what the big cloud vendors do with their monorepos. This makes sense when you have upstreams which are slow at upgrading (e.g. it looks like Debian is upgrading packages packages using older bootstrap to bootstrap v5 across the board, and such fixes get pushed upstream; there's also tooling to watch new releases, so Debian's tooling effectively acts like a system-wide dependabot).
This isn't a Debian problem though, that's the point: if bcachefs-tools has such specific dependencies, then why doesn't it vendor it's own dependencies so it's clear they are packaged and used independently?
A bunch of the packages in the release at hand were actually upgraded, not downgraded, for example.
A counterargument would be that Rust+Cargo pins specific versions already. If you’re writing Rust, you should rarely need to vendor anything unless you’re maintaining a patched version or something.
Vendoring also bypasses the package cache and build cache. If 2 apps depend on foo-1.2.3, they can normally share the cached package and its build artifacts.
Basically, Cargo goes a long way toward ensuring you rarely need to bother with adding someone else’s code to your repo.
Cargo does a per-project build cache, not a shared one.
Oh, guess it does. I've been using sccache so long that I'd forgotten that.
Do you know off-hand why it doesn't, though? If 2 packages use foo 1.2 with the same features and, say, the default build settings, why not share them by default?
I think at the time Cargo was made it was just far simpler to implement. It's not just that, it's also rustc version, sometimes environment variables... much less likely to cause problems by keeping it per project. Of course that stuff still needs to be kept track of, but like, "to get a clean build, kill this directory" seems easier. Not sure if there is an explicit justification written down anywhere from 11 years ago.
I don't expect ZFS to be dethroned this decade, nor the next one.
I used to run zfs with bcache blocks lol
Some of the zpool configurations I have set up seem "odd" to others, but they do have a purpose and are well thought out.
What motivated you to combine ZFS and bcache? What did that configuration look like? What was the thinking behind it? I would prefer to inquire further rather than make and present wrong assumptions as to why the configuration is wrong.
User uses experimental fs. It breaks.
Surprised Pikachu face
Yes and no—said user ran this as an experiment (relevant quote: "But with a filesystem this young, you inevitably have to accept some rough edges in return for some fun."), but eventually stopped it partially due to the people involved (relevant quotes: "But here's the main issue for me: Reporting bugs to bcachefs is not a pleasant experience."; "And at some point, the fun just stopped for me."), not just because there were bugs.
fun is not something i want to describe my fs as
If that's what you got from the article, I recommend you read it again.
I stick to OpenZFS, it's rock solid.
Filesystems are the one thing I don't want to experiment with. Too much potential trouble.
Can it still be called "rock solid" with a data corruption bug in its younger history? https://news.ycombinator.com/item?id=38770168
Every filesystem has had data corruption bugs. Unfortunately, that's just how it is with the complexity of modern filesystems and the tools we have.
(We're still coding in bloody _C_).
What you really don't want is to be losing entire filesytems. A single file here and there getting corrupted is one thing, and it's usually highly workload dependent so it'll probably only hit one application, but losing an entire filesystem is much more impactful.
ext 2/3/4 has the best track record here (and note, it's also had data corruption bugs), with the relatively simple on disk format and e2fsck being quite robust. XFS is probably next up, I've seen reports of XFS filesystems being lost but only one or two - a tiny, tiny fraction of the btrfs reports. Can't speak as much to ZFS.
> Can it still be called "rock solid" with a data corruption bug in its younger history? https://news.ycombinator.com/item?id=38770168
Ok.It is sand solid. There were a lot of voices on HN trying to downplay the issue, but, a filesystem cannot be named stable, if it corrupts data, in 2020.
Absolutely. We can trust it because it has already done through all this. It is battle-proven.
Unlike some new filesystem created just yesterday.
Yes.