ISO PDF spec is getting Brotli – ~20 % smaller documents with no quality loss

88 points | by whizzx 7 hours ago

38 comments

ericpauley 3 hours ago
Some real cognitive dissonance in this article…
“The PDF Association operates under a strict principle—any new feature must work seamlessly with existing readers” followed by introducing compression as a breaking change in the same paragraph.
All this for brotli… on a read-many format like pdf zstd’s decompression speed is a much better fit.
[-]
- xxs 3 hours ago
  yup, zstd is better. Overall use zstd for pretty much anything that can benefit from a general purpose compression. It's a beyond excellent library, tool, and an algorithm (set of).
  Brotli w/o a custom dictionary is a weird choice to begin with.
  [-]
  - adzm 3 hours ago
    Brotli makes a bit of sense considering this is a static asset; it compresses somewhat more than zstd. This is why brotli is pretty ubiquitous for precompressed static assets on the Web.
    That said, I personally prefer zstd as well, it's been a great general use lib.
    [-]
    - dist-epoch 2 hours ago
      You need to crank up zstd compression level.
      zstd is Pareto better than brotli - compresses better and faster
      [-]
      - atiedebee an hour ago
        I thought the same, so I ran brotli and zstd on some PDFs I had laying around.
        brotli 1.0.7 args: -q 11 -w 24 zstd v1.5.0 args: --ultra -22 --long=31 | Original | zstd | brotli RandomBook.pdf | 15M | 4.6M | 4.5M Invoice.pdf | 19.3K | 16.3K | 16.1K
        I made a table because I wanted to test more files, but almost all PDFs I downloaded/had stored locally were already compressed and I couldn't quickly find a way to decompress them.
        Brotli seemed to have a very slight edge over zstd, even on the larger pdf, which I did not expect.
        [-]
        order-matters 28 minutes ago
        Whats the assumption we can potentially target as reason for the counter-intuitive result?
        that data in pdf files are noisy and zstd should perform better on noisy files?
        [-]
        jeffbee 12 minutes ago
        What's counter-intuitive about this outcome?
      - DetroitThrow an hour ago
        I love zstd but this isn't necessarily true.
      - dchest an hour ago
        Not with small files.
      - jeffbee 2 hours ago
        Are you sure? Admittedly I only have 1 PDF in my homedir, but no combination of flags to zstd gets it to match the size of brotli's output on that particular file. Even zstd --long --ultra -22.
  - greenavocado 3 hours ago
    This bizzare move has all the hallmarks of embrace-extend-extinguish rather than technical excellence
bhouston 3 hours ago
Are they using a custom dictionary with Brotli designed for PDFs? I am not sure if it would help or not, but it seems like one of those cases it may help?
Something like this:
https://developer.chrome.com/blog/shared-dictionary-compress...
In my applications, in the area of 3D, I've been moving away from Brotli because it is just so slow for large files. I prefer zstd, because it is like 10x faster for both compression and decompression.
[-]
- whizzx 2 hours ago
  The pdf association is still running experiments on whether or not to support custom dictionaries based on real life workloads gains.
  So it might land in the spec once it has proven if offers enough value
bobpaw 3 hours ago
How can iText claim that adding Brotli is not a backward incompatible change (in the "Why keep encoding separate" table)? In the first section the author states that any new feature must work seamlessly with existing readers. New documents created that include this compression would be unintelligible to any reader that only supports Deflate.
Am I missing something? Adoption will take a long time if you can't be confident the receiver of a document or viewers of a publication will be able to open the file.
[-]
- whizzx 3 hours ago
  It's prototypish work to support it before it land's in the official specification. But it will indeed take some adoption time.
  Because I'm doing the work to patch in support across different viewers to help adoption grow. And once the big opensource ones ship it pdfjs, poppler, pdfium, adoption can quickly rise.
  [-]
  - croes 2 hours ago
    There are old devices where the viewer can’t be patched. That’s killing one of the main features of PDF
ndriscoll 2 hours ago
What is the point of using a generic compression algorithm in a file format? Does this actually get you much over turning on filesystem and transport compression, which can transparently swap the generic algorithm (e.g. my files are already all zstd compressed. HTTP can already negotiate brotli or zstd)? If it's not tuned to the application, it seems like it's better to leave it uncompressed and let the user decide what they want (e.g. people noting tradeoffs with bro vs zstd; let the person who has to live with the tradeoff decide it, not the original file author).
[-]
- Someone 8 minutes ago
  - inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text
  - when jumping from page to page, you won’t have to decompress the entire file
- eru an hour ago
  Well, if sanity had prevailed, we would have likely stuck to .ps.gz (or you favourite compression format), instead of ending up with PDF.
  Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.

ksec 3 hours ago

Why not zstd?

[-]

HackerThemAll 2 hours ago

I think this was the main reason (from the linked article) LOL:

"Brotli is a compression algorithm developed by Google."

They have no idea about Zstandard nor ANS/FSE comparing it with LZ77.

Sheer incompetence.

[-]

mort96 3 minutes ago

I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics.

I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

Here's a table with all the files:

    +------+------+------+------+--------+
    | raw  | zstd | xz   | gzip | brotli |
    +------+------+------+------+--------+
    | 12K  | 12K  | 12K  | 12K  | 12K    |
    | 20K  | 20K  | 20K  | 20K  | 20K    |
    | 20K  | 20K  | 20K  | 20K  | 20K    |
    | 20K  | 20K  | 20K  | 20K  | 20K    |
    | 20K  | 20K  | 20K  | 20K  | 20K    |
    | 20K  | 20K  | 20K  | 20K  | 20K    |
    | 24K  | 20K  | 20K  | 20K  | 20K    |
    | 24K  | 20K  | 20K  | 20K  | 20K    |
    | 24K  | 20K  | 20K  | 20K  | 20K    |
    | 24K  | 20K  | 20K  | 20K  | 20K    |
    | 24K  | 20K  | 20K  | 20K  | 20K    |
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 32K  | 20K  | 20K  | 20K  | 20K    |
    | 32K  | 20K  | 20K  | 20K  | 20K    |
    | 32K  | 20K  | 20K  | 20K  | 20K    |
    | 32K  | 24K  | 24K  | 24K  | 24K    |
    | 40K  | 32K  | 32K  | 32K  | 32K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 48K  | 36K  | 36K  | 36K  | 36K    |
    | 48K  | 48K  | 48K  | 48K  | 48K    |
    | 76K  | 128K | 72K  | 72K  | 72K    |
    | 84K  | 140K | 84K  | 80K  | 80K    |
    | 84K  | 140K | 84K  | 80K  | 80K    |
    | 84K  | 140K | 84K  | 80K  | 80K    |
    | 84K  | 140K | 84K  | 80K  | 80K    |
    | 84K  | 140K | 84K  | 80K  | 80K    |
    | 84K  | 140K | 84K  | 80K  | 80K    |
    | 84K  | 140K | 84K  | 80K  | 80K    |
    | 88K  | 136K | 76K  | 76K  | 76K    |
    | 124K | 152K | 88K  | 92K  | 92K    |
    | 124K | 152K | 92K  | 96K  | 92K    |
    | 140K | 160K | 100K | 100K | 100K   |
    | 152K | 188K | 128K | 128K | 132K   |
    | 188K | 192K | 184K | 184K | 184K   |
    | 264K | 256K | 240K | 244K | 240K   |
    | 320K | 256K | 228K | 232K | 228K   |
    | 440K | 448K | 408K | 408K | 408K   |
    | 448K | 448K | 432K | 432K | 432K   |
    | 516K | 384K | 376K | 384K | 376K   |
    | 992K | 320K | 260K | 296K | 280K   |
    | 1.0M | 2.0M | 1.0M | 1.0M | 1.0M   |
    | 1.1M | 192K | 192K | 228K | 200K   |
    | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.2M | 1.1M | 1.0M | 1.0M | 1.0M   |
    | 1.3M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.7M | 2.0M | 1.7M | 1.7M | 1.7M   |
    | 1.9M | 960K | 896K | 952K | 916K   |
    | 2.9M | 2.0M | 1.3M | 1.4M | 1.4M   |
    | 3.2M | 4.0M | 3.1M | 3.1M | 3.0M   |
    | 3.7M | 4.0M | 3.5M | 3.5M | 3.5M   |
    | 6.4M | 4.0M | 4.1M | 3.7M | 3.5M   |
    | 6.4M | 6.0M | 6.1M | 5.8M | 5.7M   |
    | 9.7M | 10M  | 10M  | 9.5M | 9.4M   |
    +------+------+------+------+--------+

Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.

Going by only compression ratio, Brotli is clearly better than the rest here, and Zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli here.

I wish I could share the data set for reproducibility, but I obviously can't do that with every PDF I happened to have laying around in my downloads folder :p

PunchyHamster 3 hours ago
incompetence
[-]
- whizzx 2 hours ago
  You can read about it here https://pdfa.org/brotli-compression-coming-to-pdf/
  [-]
  - jeffbee 2 hours ago
    That mentions zstd in a weird incomplete sentence, but never compares it.
    [-]
    - eviks 2 hours ago
      Hey, they did all the work and more, trust them!!!
      > Experts in the PDF Association’s PDF TWG undertook theoretical and experimental analysis of these schemes, reviewing decompression speed, compression speed, compression ratio achieved, memory usage, code size, standardisation, IP, interoperability, prototyping, sample file creation, and other due diligence tasks.
    - F3nd0 2 hours ago
      They don’t seem to provide a detailed comparison showing how each compression scheme fared at every task, but they do list (some of) their criteria and say they found Brotli the best of the bunch. I can’t tell if that’s a sensible conclusion or not, though. Maybe Brotli did better on code size or memory use?

h4x0rr 3 hours ago
Wouldn't lzma2 be better here since a pdf is more read heavy?
[-]
- F3nd0 3 hours ago
  Going by one of Brotli’s authors’ comment [1] on another post, it probably wouldn’t.
  [1] https://news.ycombinator.com/item?id=46035817
avalys 2 hours ago
This article is AI slop.
[-]
- jeffbee 39 minutes ago
  Yep.
cess11 3 hours ago
'Your PDF:s will open slower because we decided that the CDN providers are more important than you'.
If size was important to users then it wouldn't be so common that systems providers crap out huge PDF files consisting mainly of layout junk 'sophistication' with rounded borders and whatnot.
The PDF/A stuff I've built stays under 1 MB for hundreds of pages of information, because it's text placed in a typographically sensible manner.
delfinom 4 hours ago
tl;dr Commerical entity is paying to have the ISO altered to "legalize" their SDK they are pushing which is incompatible with standard PDF readers.
ISO is pay to play so :shrug:
[-]
- whizzx 3 hours ago
  No this feature is coming straight from the PDF association itself and we just added experimental support before it's officially in the spec to help testing between different sdk processors.
  So your comment is a falsehood
- lmz 3 hours ago
  It's not even clear that they were the ones suggesting inclusion. They're just saying their library now supports the new thing.
  https://pdfa.org/brotli-compression-coming-to-pdf/
  > As of March 2025, the current development version of MuPDF now supports reading PDF files with Brotli compression. The source is available from github.com/ArtifexSoftware/mupdf, and will be included as an experimental feature in the upcoming 1.26.0 release.
  > Similarly, the latest development version of Ghostscript can now read PDF files with Brotli compression. File creation functionality is underway. The next official Ghostscript release is scheduled for August this year, but the source is available now from github.com/ArtifexSoftware/Ghostpdl.
  [-]
  - adrian_b 2 hours ago
    Yes, I do not see any source of financial gain that could motivate them for this, because both MuPDF and Ghostscript are free.
    MuPDF is an excellent PDF reader, the fastest that I have ever tested. There are plenty of big PDF files where most other readers are annoyingly slow.
    It is my default PDF and EPUB reader, except that in very rare cases I encounter PDF files which MuPDF cannot understand, when I use other PDF readers (e.g. Okular).
- bhouston 4 hours ago
  I'm no fan of Adobe, but it is not that hard to add brotli support given that it is open. Probably can be added by AI without much difficulty - it is a simple feature. I think compared to the ton of other complex features PDF has, this is an easy one.
nialse 2 hours ago
Who is responsible for the terrible decision? In the pro vs con analysis, saving 20% size occasionally vs updating ALL pdf libraries/apps/viewers ever built SHOULD be a no-brainer.