The Garbage Collection Handbook: The Art of Automatic Memory Management (2nd Ed) (2023)

223 points | by teleforce 20 hours ago

47 comments

pdevr 17 hours ago
I had the 2012 print edition. One of the best books available - the best book that I knew - about GC at that time.
Anti-pattern: Regarding the 2023 e-book edition, I do not see a way to buy it from the site, or even a link to buy.
[-]
- Jtsummers 17 hours ago
  I thought I was missing something but figured they just didn't have a link to purchase it. I ended up going to the publisher's site to track it down:
  https://www.routledge.com/The-Garbage-Collection-Handbook-Th...
  [-]
  - LandR 9 hours ago
    How on earth is the ebook more expensive than the physical copy!
    [-]
    - LtdJorge 6 hours ago
      And it's not an actual ebook. You have to create an account on the "ebook" provider's site and read it through their website or app.
      [-]
      - metalliqaz 4 hours ago
        Worthless.
      - pmarreck 5 hours ago
        I hate when they do this. I'd rather just buy the physical book, donate it to a library and then obtain the liberated version of the ebook "elsewhere".
    - PaulRobinson 8 hours ago
      The physical copy will only get used by one reader (at a time). The ebook is going to be, err, "liberated" more often than not around a dorm?
      In all seriousness, this is likely a nudge to a preference they have for how they want to sell this and how you should want to buy it.
      [-]
      - LtdJorge 6 hours ago
        They advertise the ebook as having interactive features and more compared to the print version. Shouldn't it be the preferred one?
- toolslive 11 hours ago
  I have the 2012 print edition too and fully concur with the assessment. The best book about garbage collection at the time (and maybe still?)
piinbinary 6 hours ago
I have this book and like it.
I also wish there was a book that guided you through the process of implementing a language with accurate garbage collection, similar to how Crafting Interpreters teaches you to implement a language. Perhaps it could start with a shadow stack + simple mark-and-sweep and then move on to stack maps + generational GC.
ChrisArchitect 16 hours ago
Previously:
Dec 2025 https://news.ycombinator.com/item?id=46357870
Apr 2023 https://news.ycombinator.com/item?id=35492307
orionblastar 20 hours ago
I remember reading it before. My son threw it away when we moved houses, not knowing how important it was. I'd recommend it.
[-]
- tikhonj 19 hours ago
  Ironic that this of all books got garbage collected prematurely.
- travisgriggs 19 hours ago
  Q: what kind of collection is this real world example illustrating?
  A: Copying Garbage Collector (semi space). Chapter 4!
  Great book. I was always fascinated by bakers treadmill. Always wanted a real world case where I could implement one with Fibonacci sized mills.
rayiner 18 hours ago
How good are AIs at coding manual memory management? Is this a sea change in automatic memory management?
[-]
- adrian_b 12 hours ago
  In my opinion, the "manual" memory management, introduced by the IBM PL/I programming language by the end of 1964, and inherited by C and other languages, i.e. where the programmer is responsible for invoking "free", was a serious mistake and it was an obsolete technique already at the date of its introduction.
  When the explicit "free" was invented, automatic memory reclamation while avoiding the non-determinism of garbage collectors had already been known for 4 years, since 1960, when another IBM employee had invented reference counting (as a reaction to the garbage collector of LISP I).
  When implemented naively, reference counting has some disadvantages, but those can be circumvented relatively easily in an optimized implementation. The book discussed in the parent article also has a chapter about reference counting.
  I have written C programs for many decades, but I have never invoked "free" directly, because I have always used reference counts. I have never encountered a circumstance when I would have wanted to invoke "free" directly.
  C has the disadvantage that the compiler will not do implicitly things like virtual function invocation, reference counts handling etc. but any such techniques that are provided by higher-level languages can still be used in a language like C, even if they require more boilerplate code.
  I do not like the "shared_ptr" implementation of reference counting in C++, because that data type is not directly usable in places where a plain reference or pointer is expected. Implementations that do not have this problem exist.
  [-]
  - dcuthbertson 6 hours ago
    I too have written C programs for decades. I encountered reference counting when I learned how to write Windows kernel drivers. It was very liberating to see that there were so fewer opportunities for memory leaks when reference counts were applied liberally.
- a-french-anon 12 hours ago
  GC's strength is not only in the ease of writing but at also reading, since you don't need to interleave allocation and business logic everywhere (be it through types or imperative code).
  GC simply is the only way to approach the clarity of pseudo-code in real code. That's one of my later realizations concerning the subject (https://world-playground-deceit.net/blog/2024/11/how-i-learn...)
  [-]
  - adrian_b 11 hours ago
    You mean "interleave deallocation and business logic everywhere".
    For allocation there is no difference between automatic memory management with garbage collectors or reference counts and manual memory management, where the programmer is responsible for invoking "free".
    These alternative memory management methods differ only in how deallocation is handled.
    Allocation must always be done by defining a new object, regardless of how memory is managed. Moreover, allocation also does not depend on whether an object is allocated in static storage, in a stack or in a heap. You always must define the object, so that memory should be allocated for it at compile-time if in static storage, or at run-time if in a stack or in a heap.
    [-]
    - a-french-anon 11 hours ago
      Well, it's true that most of the noise comes from deallocation but I did mention types for good reasons: having your code littered with std::shared_ptr (or worse for Rust where encoding lifetime goes much further than Rc/Arc) is a direct consequence of wanting GC without a global one.
      The source of my second revelation: GC should be opt-out (e.g. SBCL's arena system) instead of opt-in via refcounted types.
      [-]
      - adrian_b 8 hours ago
        While C and C++ have ugly syntaxes for "malloc", "std::shared_ptr" and the like, it is quite easy to mask that ugliness by using macros and writing thus only cleanly-looking programs, without any noise words and symbols around dynamic allocations.
        In general, by using macros it is possible to transform so much a C or C++ program, that it becomes unrecognizable as C/C++ and it can mimic reasonably well any other programming language that you might fancy.
        The problem is when you work in a team, because even if everyone will agree that such programming languages have great deficiencies, it would be impossible to reach a consensus about how the ideal programming language should look like, so eventually the team remains stuck with writing programs in the ugly standard manner.
- mtklein 17 hours ago
  I have never seen Codex or Claude get manual memory management wrong. I used to be pretty fastidious about using leak sanitizer or other such tools to catch my own memory management issues, and while not quite useless, that sort of testing has dropped way down my list of worries the more I lean on LLMs. I am constantly surprised by how many formerly tedious or error prone tasks stopped being either of those, and I expect to see practice shift away from middle-safe languages like C++ to not just much more safe languages like Rust but surprisingly also to much less safe ones like C and platform specific assembly.
  [-]
  - gf000 13 hours ago
    The hard part was never getting it correct on a local scope, that's mostly solved by a linter, or even C++'s RAII will get it right.
    The hard part is doing it correctly on a global scope with non-trivial lifetimes, possibly influenced by multiple threads.
    And in my experience LLMs are still hit or miss on these kind of problems, they can find problems from time to time, but they can't really reason well about more complex global state reliably. They will come up with "hypotheses" that 'oh sure this is the root cause of the issue' only to say something completely wrong (which you may notice or not, only to fail later)
- pmarreck 5 hours ago
  In Zig, pretty good, but not perfect yet
hamstergene 17 hours ago
What I didn’t like about this series of books was choosing “garbage collection” as umbrella term for both tracing GC and reference counting, without verifying if programming community would agree with that, which turned out they didn’t.
I’ve seen a lot of threads here and on reddit where people were arguing about terminology purely because of this book alone.
By that definition, C++ code has garbage collection if it uses std::shared_ptr, going against widespread common usage of the term “garbage collected programming language” which specifically contrasts manual languages like C++ or Rust against garbage collected ones.
“Automatic Memory Management” is a lot more suitable description to what programmers have to do to manage memory; it is now in the title but still hasn’t become the primary term.
[-]
- pron 17 hours ago
  > What I didn’t like about this series of books was choosing “garbage collection” as umbrella term for both tracing GC and reference counting, without verifying if programming community would agree with that, which turned out they didn’t.
  This has been the standard terminology in memory management research for many decades. The only programmers who don't like it are those who don't understand the principles of memory management.
  > By that definition, C++ code has garbage collection if it uses std::shared_ptr
  That's right.
  > going against widespread common usage of the term “garbage collected programming language” which specifically contrasts manual languages like C++ or Rust against garbage collected ones.
  Since this contrast mostly exists in the minds of people who don't understand memory management, going against this common misconception is good. That's not to say that there aren't some interesting tradeoffs that often align with the colloquial perception, "garbage collection" isn't the interesting part. As you said, both C++ and Rust use GC; in fact, they use a GC somewhat similar to the one used by CPython.
  [-]
  - gdwatson 15 hours ago
    This reminds me a bit of the way academics in programming language theory internalized the type-theoretic definition of the word “type” over and against the traditional programming definition. You sometimes see people who try to correct the term “dynamically typed language,” which makes perfect sense when types are data types, to “untyped” or “unityped,” which makes sense when types are mathematical constructs equivalent to proofs.
    The colloquial term is clear in context, and it draws its boundaries in useful places. If academia prefers other boundaries to simplify its formal definitions, that’s understandable. But the rest of us shouldn’t restrict our language in that way.
    [-]
    - pron 7 hours ago
      It's not about restricting the language. It's that practising programmers often don't know a subject well enough, so they use different words to make distinctions that don't matter as much as they think (see "transpile"). "Dynamically typed" is actually not that big of an offence (because the distinction is real, it's just that the terminology is a bit muddled), and the people in PL theory who are bothered by this (most notably one person) are considered pedants even among their colleagues.
      E.g. many practising programmers don't know that tracing moving collectors are used to avoid some of the high overheads associated with memory allocators (malloc/free), which are themselves big and complex beasts that make up substantial "runtimes" (another misused and misleading word).
    - gf000 13 hours ago
      I think GC's definition is pretty clear cut. How is counting references to determine when a lifetime ends materially different from another way of doing the same thing? Like there is even a paper that shows that one is tracking liveness, while the other tracks "deadness" and they are literally going at the same thing from different ends.
      If anything, I often see a bias against tracing GCs from the people misusing the term, to "hype up" their choice of language that it must be better for not having (tracing) GC, when it usually just has ref counting which in many metrics is actually worse, given equal usage -- rust/cpp gets away from that because they only use it on a handful of objects, other lifetimes being driven by RAII, which is pretty much just compile-time decidable ref counting?
      [-]
      - pron 5 hours ago
        Right, and there are differences within tracing GCs that are just as big as between refcounting (and even manual malloc/free) and tracing. For example, Go uses tracing to determine when an object lifetime ends. But the moving collectors in Java, .NET, and V8 don't know and don't care when objects die, and they have no "free" operation at all. In many ways, the performance profile (of favouring smaller footprint or higher throughput) of memory management in C++, Rust, Python, and Go share more similarities among themselves than Java, .NET, V8, and Zig, which also share a more similar profile (arenas, like moving collectors, don't need or want to know when an object's lifetime ends).
        Another distinction without a difference that is really just giving a name to a misconception is the notion of "a runtime". When I learnt C in the late 80s or early 90s, the book said something like, "C is not just the language, but a rich runtime". Indeed, modern malloc/free implementations mean that a C program ends up needing a larger and more elaborate runtime than a program in some educational language that uses a trivial implementation of a mark-and-sweep collector. Modern malloc/free allocators also sometimes come with an impressive set of tuning knobs. It's just that people who haven't had a lot of experience writing large programs in low-level languages don't know about them (or they just work to avoid allocations as much as possible, because that's what they've been told to do).
      - hayley-patton 11 hours ago
        > Like there is even a paper that shows that one is tracking liveness, while the other tracks "deadness" and they are literally going at the same thing from different ends.
        https://dl.acm.org/doi/10.1145/1035292.1028982
      - gdwatson 11 hours ago
        I think a lot of people just want to be able to discuss different areas of the automatic memory management design space separately, and maintaining the distinction between reference counting and garbage collection (meaning tracing GCs) lets them do that.
        As for me personally, I consider refcounting and GC overlapping categories. I am perfectly willing to call CPython’s reference counting plus cycle collector a form of garbage collection, because it is transparent to the programmer. Every memory management technique has tradeoffs and pathological edge cases, but since you don’t have to consider them in the ordinary course of programming I’d say it counts. If you had to break cycles manually, or to annotate which references should be counted, I’d call that refcounting but not GC – as in the C++ stdlib.
        [-]
        pron 7 hours ago
        > I think a lot of people just want to be able to discuss different areas of the automatic memory management design space separately, and maintaining the distinction between reference counting and garbage collection (meaning tracing GCs) lets them do that.
        The problem is that there are many differences in memory management techniques that offer different tradeoffs, and the difference between refcounting and tracing is not necessarily the biggest of them.
        For example, one of the most important distinctions in memory management is whether it optimises for footprint or speed (or some compromise), and the line isn't where people who don't understand memory management think it is. It can matter (often a great deal) whether you determine that an object is dead dynamically (say, by counting references) or statically (by manually writing free or by having the language track lifetimes), but it doesn't matter as much as whether or not the mechanism needs to know when objects are dead in the first place. So reference counting, manual free, static lifetimes, and even non-moving mark-and-sweep tracing collectors (like Go's) generally optimise for footprint at the expense of speed (although different allocators can have some control over that tradeoff), while arenas and tracing moving collectors optimise for speed at the expense of footprint (although here, too, they have some control over the tradeoff). So the line for this super-important tradeoff is between [manual, static, refcoutning] and [arenas, moving tracing]; non-moving tracing collectors are somewhere in between but may be closer to the first group.
        People who don't understand memory management and may not have a lot of experience in low-level programming sometimes think that manual or statically-determined freeing must be fast because low-level languages, which inexperienced people think are fast, use them. In fact, low-level languages have some concerns that are much more important than speed and that preclude them from optimisations such as moving pointers. To get around that performance handicap, these languages try to avoid using their heap memory management as much as possible because they're using a rather slow technique because of their constraints.
        [-]
        senderista an hour ago
        "Speed" is also ambiguous between latency and throughput. You seem to be using "speed" here as a synonym for throughput. Because of Little's Law, the memory consumed by deallocated objects is directly proportional to deallocation latency, so "low footprint" also generally means "low latency", while increasing throughput by amortizing deallocation overhead at the expense of latency increases memory usage for the same reason.
        [-]
        pron 36 minutes ago
        > so "low footprint" also generally means "low latency"
        Not anymore.
        You're absolutely right that one of the reasons moving collectors were not used more widely was that, while their throughput was always very impressive, their latency wasn't that great, but that changed a few years ago.
        E.g. Generational ZGC in OpenJDK (released in September '23) introduces hiccups or "pauses" that are not dependent on the size of the liveset and are no larger than latency hiccups introduced by the OS (assuming no realtime kernel), i.e. <1ms (and typically <<1ms) up to heaps of 16TB. In fact, the latency can be smoother than approaches that have an explicit free operation and require maintaining a free list, as freeing a large object graph can be quite slow and occur in surprising places.
        So modern moving GCs no longer have a latency penalty, but this is newer than even ChatGPT.
        cb321 8 hours ago
        I don't really disagree much with what you said. My favored PLang Nim (https://nim-lang.org/ -- it has both `ref` and `ptr` styles of pointer, one auto-managed, one manually managed) even changed a while back it's `nim c --gc=x` command-line language to `nim c --mm=x`, and I was in favor of said change.
        However, it does inspire me to write.. The kernel of all this terminology confusion is under-exposure of industrial programmers to not just academic terminology, but also the very design space you mention (which has always been nicely covered by Jones' outstanding book). Just to take an example from the root of this thread:
        >widespread common usage of the term “garbage collected programming language” which specifically contrasts manual languages like C++ or Rust against garbage collected ones
        Boehm-Wiser conservative collection for C, among the most manual languages of all, pre-dates its very first ANSI 1989 standard.
        This underexposure itself is downstream of the kinds of oversimplifications/lies of marketing and in this particular case came from Java. The evolution I witnessed was roughly 1) linking Boehm with -lgc and deleting (or #define'ing away) all your `free()` calls is conservative - to be precise you need compiler aid and a lot of programmers are "not perfect==awful" personality types, 2) Sun Microsystems wants to leverage a lot of reliability issues with C code and become The Platform and spends gobs of money to win hearts & minds, partly succeeding, 3) part of its ad-warfare against the then WIntel hegemony and/or tutorials/introductory material for Junior Programmers (often the target of "be more reliable" material) plays fast & loose with GC terminology because marketing plays fast & loose structurally for fun but mostly profit, 4) because human language really does == language usage a la Quine, everyone in the industry re-defines what "GC" means to bind it to a programming language instead of to a specific run-time, 5) industry & academics use different language, confusion ensues and so here we are.
        This is not even the 100th time that either explicit or implicit forces of marketing have achieved confusion analogously to this. If you believe most people don't need much of what they spend on then confusion is arguably intrinsic to marketing of ideas/products. The highly misleading but suggestive metaphorical language used all over "AI" in both research and in product-lines is a more current case of this, leading anyone who knows much to have to qualify "not AGI" or other such junk just to have a conversation.
        So, what is my point? Basically just that the larger problem here will persist as long as there is money to be made/attention to be garnered by sowing confusion/having people talk past each other/think some product is more than it really is. I have no meta-strategy in my back pocket to block these successful confusions, but it does seem worth being aware of it.
      - deliciousturkey 7 hours ago
        By that definition even C has garbage collection. Automatic storage duration types have compiler-determined lifetime and automatic deallocation.
        If the definition of a word/concept does not match how the word is used in real life, the definition is wrong. After all, semantics is about common understanding of concepts. If your definition of a word doesn't match how it's used, using that definition is not beneficial to use.
        [-]
        gf000 7 hours ago
        Well yeah, stack variables are automatically reclaimed. What's your issue?
        It's just that this is not the predominant way C programs are written and for everything else you do need to somehow manage the memory, malloced objects would otherwise just leak. What exactly is the issue, the real life use of C requires manually adding free calls, is it not? So it doesn't do automatic memory management for you.
        [-]
        deliciousturkey 7 hours ago
        The term "garbage collection" does not mean that the language has some mechanism of automatically reclaiming some memory. If it did, C would be a garbage-collected language. The term is not used in such way.
        Now, of course reference counting can be used as a part of a garbage collector. But that doesn't mean any language that allows you to implement reference counting as a library, is a garbage-collected language.
        [-]
        gf000 6 hours ago
        Yeah, and?
        We are in agreement here, C++ is not a GCd language. What I (we) claim is that reference counting is a GC, that's it. A language that uses RC 100% would be a GCd language, like python (okay, it does have a tracing GC to collect cycles as well). C++/rust has the necessary language primitives to express reference counting as a library, but that's an optional thing, usually applied only to select few objects. That's a bit like Java can also just allocate a byte buffer and do manual memory management, neither makes a language GCd/manual in and of itself.
        pron 4 hours ago
        > But that doesn't mean any language that allows you to implement reference counting as a library, is a garbage-collected language.
        The concept of "a garbage-collected language" is not well-defined. There are languages, like Java, Rust, and Python that depend on a garbage collection mechanism, and languages like C, C++ and Zig, which don't. C++ happens to offer a GC in its standard library, however.
        That "working developers" use some other terminology is not what matters. What matters is whether the terminology they're using expresses important distinctions or not (and may, in fact, express misconceptions about distinctions). In the case of memory management (as in the case of "transpiles", although there the damage isn't as high), the colloquial terminology is misleading as it is used to hint at distinctions (such as about performance) which are simply not there. E.g. moving GCs are used to avoid the performance overheads of malloc/free, especially in large and/or concurrent programs. This performance overhead that C and C++ suffer from is well known to experienced low-level developers (which is partly why large programs that benefit from moving collectors are relatively rarely written in such languages anymore), but now the terminology is used as a cargo cult, which leads to conclusions that are sometimes the very opposite of what's really going on.
        [-]
        senderista an hour ago
        I think it would be a salutary experience for every C/C++ programmer to write a decently-performing allocator so they could appreciate the complexity and overhead required to avoid excessive fragmentation (especially in long-running programs with long-lived allocations and irregular deallocations), given the constraint of address stability.
- mschuetz 4 hours ago
  I've always considered shared_ptr to be semi-garbage collection. Allows me to code C++ almost as if it were Java so long as circular references are avoided. I'm perfectly fine with it being considered a type of garbage collection.
  [-]
  - Jtsummers 3 hours ago
    > so long as circular references are avoided
    And there's always weak_ptr if a cycle makes sense for some reason but you still want it to clean up correctly. Like having a child node point up to a parent or the root in a tree structure.
- trumpdong 17 hours ago
  The Linux kernel has garbage collection, and not just the controversial refcount kind.
  [-]
  - yencabulator 12 minutes ago
    I'll go further. Linux heavily uses a form of garbage collection that cannot be implemented in typical userspace (without awkward & slower additions to the consistency algorithm).
    https://en.wikipedia.org/wiki/Read-copy-update
- kibwen 5 hours ago
  Instead of "garbage collection", you can say "dynamic lifetime determination". If code does work at runtime to answer the question "is it safe to free this piece of memory?", that's dynamic lifetime determination, and is a property shared by both reference-counting and more sophisticated GC schemes.