I loved this clever, weird, awesome paper, so a short summary.
In many dynamic languages some values are stored on the heap ("boxed") and represented as a pointer, while others are represented as an immediate value ("unboxed" or "immediate"). Pointer tagging is a common way to do that: the low bit of the value tells you the value's type, and some types are immediate while others are boxed.
Naturally, the tag bits have a fixed value, so can't be used to store data. So for example your language might offer 61-bit integer immediates instead of 64-bit integers; the other three bits are used for tags. Possibly, larger integers are stored on the heap and treated as a different type (for example Python 2.X had separate int and long types for these cases).
However, it's hard to use this strategy for floats, because floats need all 64 bits (or 32 bits for single-precision, same difference). There's a trick called "NaN boxing" which makes use of the large number of NaNs in the float representation, but read the paper if you want more on that.
The authors' insight is that, suppose you have a three-bit tag and 011 is the tag for floats. By totally random chance, _some_ floats will end in 011; you can represent those as immediates with those tag bits. Obviously, that's unlikely, though you can raise the chances by using, like, 010, 011, 100, and 101 all as float tags. Still, the low bits are a bad choice. But what about high bits? Most floats have one of a couple high bit patterns, because most floats are either 0 or between, say, 1e-100 and 1e100. Floats outside that range can be boxed but since they're really rare it's not a big cost to box them.
So basically, we use high bits as our tag bits and map all the common float prefixes to float tags. This allows unboxing the vast majority of floats, which leads to big speedups on float-heavy benchmarks.
A personal note: I've been working in numerics and floating-point for a decade now and have had to deal with float boxing both from a research point of view (lots of runtime analysis systems for floats), from a user point of view (using unboxed float vectors for significant speedup in my own software), and from a teaching point of view (discussing boxing in my compilers class, using NaN-boxing as an example of cleverness).
This idea is so simple, so crazy, so stupid, and works so well, but I never thought of it. Bravo to the authors.
> This idea is so simple, so crazy, so stupid, and works so well, but I never thought of it. Bravo to the authors.
Thanks for the nice summary -- looking forward to read the paper!
The same idea of self-tagging is actually also used in Koka language [1] runtime system where by default the Koka compiler only heap allocates float64's when their absolute value is outside the range [2e-511,2e512) and not 0, infinity, or NaN (see [2]). This saves indeed many (many!) heap allocations for float intensive programs.
Since Koka only uses 1 bit to distinguish pointers from values, another slightly faster option is to only box negative float64's but of course, negative numbers are still quite common so it saves less allocations in general.
ps. If you enjoy reading about tagging, I recently wrote a note on efficiently supporting seamless large integer arithmetic (as used in Koka as well) and discuss how certain hardware instructions could really help to speed this up [3]:
Do all float operations need to reconfirm those bits afterwards though? I suppose if you have some sort of JIT you can end up with a bunch of unboxed floats and would only pay the cost on boundaries though
Thanks - I hadn't thought about that but it seems to be the main downside of this approach. The benefit of NaN-boxing is that it reassigns values that are otherwise unused - floating-point calculations will never generate NaNs with those bit patterns.
An additional wrinkle is that NaNs are a bit unstable and can have large performance penalties. You can't let the NaNs ever escape into arithmetic and you may even have issues even storing them in a register.
Yes, but there should be some optimisation opportunities.
Off the top of my head: Any multiply by a constant less than 1.0 will never overflow the unboxed range (but might underflow) and there should be times when it's provably better to check the inputs are inside a range, rather than checking the outputs.
It's worth pointing out that these overflow/underflow checks will be very cold (on typical code). They won't waste much in the way of branch-prediction resources.
I wonder if it's worth taking advantage of floating point overflow/underflow exceptions. I think a multiplication by 2^767 will trigger an exception if the value would overflow, and the corresponding multiply by 2^-765 will catch underflows.
It's tempting to allocate two more tags for floats (001 and 010), covering the entire range from -2^257 to +2^257. It will be rare to actually see those small floats near zero, but it could be worth eliminating the possibility of underflows.
You need to check after the floating point operation though just in case. Or after the boundary where you pass the float to something else expecting this scheme.
> This allows unboxing the vast majority of floats, which leads to big speedups on float-heavy benchmarks.
NaN-boxing allows all floats to be unboxed though. The main benefit of the self-tagging approach seems to be that by boxing some floats, we can make space for 64-bit pointers which are too large for NaN-boxing.
The surprising part of the paper is that "some floats" is only a small minority of values - not, say, 50% of them.
A small minority, but apparently it includes all the floats you’re likely to use. It seems the insight is that you only need 8 bits of exponent in most cases. (And single-precision floating point only has 8 bits of exponent.)
Most double-precision floats are never used because they have high exponents.
also you totally can fit 64 bit pointers inside a NaN. 46 bit pointers are only 48 bits and you have 53 bits of NaN payload. (you also could get an extra 3 bits if you only allow storing 8 byte aligned pointers unboxed)
That's enough to distinguish between "unboxed float" and "something else", where the latter can have additional tag bits.
> [64-bit] pointers are only 48 bits and you have 53 bits of NaN payload.
The paper specifically talks about support for "high memory addresses that do not fit in 48 bits". If you don't have to handle those high addresses, I don't think this approach has any benefits compared to NaN-boxing.
> For instance, NaN-tagging prevents (or largely complicates) optimizations relying on stack allocations. The stack uses high memory addresses that do not fit in 48 bits unless encoded relative to the location of the stack segment.
Er, what? The paper says they tested on a Xeon CPU, so x86-64, running Linux. On traditional x86-64, all pointers fit in 48 bits, period. Stack memory is no exception. More recently the architecture was extended to allow 56-bit pointers, but my impression is that Linux (like other OSes) keeps them disabled by default in userspace. According to the documentation [1]:
> Not all user space is ready to handle wide addresses. [..] To mitigate this, we are not going to allocate virtual address space above 47-bit by default.
So how would the stack end up above 47 bits? Is the documentation out of date?
People who enjoyed this might also like to read how Apple used tagged pointers for short strings in Objective-C [0]. I think that's when I first learned about tagged pointers. NaN-boxing was mindblowing for me. I love this kind of stuff.
Another cool thing that seems related: exploiting alignment to free up N bits in a 'pointer' representation, because your values have to be aligned. The JVM does this to expand the set of possible addresses representable in 32 bits: https://shipilev.net/jvm/anatomy-quarks/23-compressed-refere...
So, for example, with 3 bits of alignment required, the first valid address for a pointer to point to after 0x0 is 0x8, and after that is 0x10, but you represent those as 0x1 and 0x2 respectively, and use a shift to get back the actual address (0x1 << 3 = 0x8, actual address). I think this is gestured at in section 1.1 of the paper, sort of, except they envision using the space thus freed for tags, rather than additional bits. (Which only makes sense if your address is 32 bits anyway, rather than 64 as in the paper: no one has 67-bit addresses. So saving 3 bits doesn't buy you anything. I think.)
> Aligning all heap-allocated values to 64-bit machine words conveniently frees the low bits of pointers to store a 3-bit tag.
It's interesting which runtimes exploit the extra space for what reasons! Definitely makes more sense to have the extra address space on 32 bits compared to 64. I wonder if the extra addresses are specific to JVM / not something that works well in the C family?
Well in C you have non-aligned pointers, because you can have pointers to things that aren't objects and might not be aligned (e.g. individual chars or shorts). In Java everything is at least 8-byte-aligned, you can't store a loose char/short/int on the heap (it has to go in a boxed object that's 8-byte-aligned, though the compiler will do this semi-automatically) and you can't take a pointer to an individual element of an array.
If you applied the C standard strictly, you could use a JVM-style representation for pointers to longs, pointers, and structs that start with longs and pointers, so you could theoretically have an implementation where those pointers were shorter. But you'd have to convert back and forth when casting to and from void* (and char*), and in practice C people expect to be able to cast a long* to int, cast that to void*, and get the same result as casting long* to void*, even though doing that and using it is undefined behaviour according to the standard.
I loved this clever, weird, awesome paper, so a short summary.
In many dynamic languages some values are stored on the heap ("boxed") and represented as a pointer, while others are represented as an immediate value ("unboxed" or "immediate"). Pointer tagging is a common way to do that: the low bit of the value tells you the value's type, and some types are immediate while others are boxed.
Naturally, the tag bits have a fixed value, so can't be used to store data. So for example your language might offer 61-bit integer immediates instead of 64-bit integers; the other three bits are used for tags. Possibly, larger integers are stored on the heap and treated as a different type (for example Python 2.X had separate int and long types for these cases).
However, it's hard to use this strategy for floats, because floats need all 64 bits (or 32 bits for single-precision, same difference). There's a trick called "NaN boxing" which makes use of the large number of NaNs in the float representation, but read the paper if you want more on that.
The authors' insight is that, suppose you have a three-bit tag and 011 is the tag for floats. By totally random chance, _some_ floats will end in 011; you can represent those as immediates with those tag bits. Obviously, that's unlikely, though you can raise the chances by using, like, 010, 011, 100, and 101 all as float tags. Still, the low bits are a bad choice. But what about high bits? Most floats have one of a couple high bit patterns, because most floats are either 0 or between, say, 1e-100 and 1e100. Floats outside that range can be boxed but since they're really rare it's not a big cost to box them.
So basically, we use high bits as our tag bits and map all the common float prefixes to float tags. This allows unboxing the vast majority of floats, which leads to big speedups on float-heavy benchmarks.
A personal note: I've been working in numerics and floating-point for a decade now and have had to deal with float boxing both from a research point of view (lots of runtime analysis systems for floats), from a user point of view (using unboxed float vectors for significant speedup in my own software), and from a teaching point of view (discussing boxing in my compilers class, using NaN-boxing as an example of cleverness).
This idea is so simple, so crazy, so stupid, and works so well, but I never thought of it. Bravo to the authors.
> This idea is so simple, so crazy, so stupid, and works so well, but I never thought of it. Bravo to the authors.
Thanks for the nice summary -- looking forward to read the paper!
The same idea of self-tagging is actually also used in Koka language [1] runtime system where by default the Koka compiler only heap allocates float64's when their absolute value is outside the range [2e-511,2e512) and not 0, infinity, or NaN (see [2]). This saves indeed many (many!) heap allocations for float intensive programs.
Since Koka only uses 1 bit to distinguish pointers from values, another slightly faster option is to only box negative float64's but of course, negative numbers are still quite common so it saves less allocations in general.
[1] https://koka-lang.github.io/koka/doc/book.html#sec-value-typ...
[2] https://github.com/koka-lang/koka/blob/dev/kklib/src/box.c#L...
ps. If you enjoy reading about tagging, I recently wrote a note on efficiently supporting seamless large integer arithmetic (as used in Koka as well) and discuss how certain hardware instructions could really help to speed this up [3]:
[3] https://www.microsoft.com/en-us/research/uploads/prod/2022/0... (ML workshop 2022)
Nice explanation but it took me a while to understand the trick. They are hidding the tag in the "exponent" of the float, not in the "mantisa"!
Do all float operations need to reconfirm those bits afterwards though? I suppose if you have some sort of JIT you can end up with a bunch of unboxed floats and would only pay the cost on boundaries though
> reconfirm those bits afterwards
Thanks - I hadn't thought about that but it seems to be the main downside of this approach. The benefit of NaN-boxing is that it reassigns values that are otherwise unused - floating-point calculations will never generate NaNs with those bit patterns.
An additional wrinkle is that NaNs are a bit unstable and can have large performance penalties. You can't let the NaNs ever escape into arithmetic and you may even have issues even storing them in a register.
Yes, but there should be some optimisation opportunities.
Off the top of my head: Any multiply by a constant less than 1.0 will never overflow the unboxed range (but might underflow) and there should be times when it's provably better to check the inputs are inside a range, rather than checking the outputs.
It's worth pointing out that these overflow/underflow checks will be very cold (on typical code). They won't waste much in the way of branch-prediction resources.
I wonder if it's worth taking advantage of floating point overflow/underflow exceptions. I think a multiplication by 2^767 will trigger an exception if the value would overflow, and the corresponding multiply by 2^-765 will catch underflows.
It's tempting to allocate two more tags for floats (001 and 010), covering the entire range from -2^257 to +2^257. It will be rare to actually see those small floats near zero, but it could be worth eliminating the possibility of underflows.
You check the tag before doing float operations
Only when they have to be boxed, but yes if you are talking about that.
You need to check after the floating point operation though just in case. Or after the boundary where you pass the float to something else expecting this scheme.
> This allows unboxing the vast majority of floats, which leads to big speedups on float-heavy benchmarks.
NaN-boxing allows all floats to be unboxed though. The main benefit of the self-tagging approach seems to be that by boxing some floats, we can make space for 64-bit pointers which are too large for NaN-boxing.
The surprising part of the paper is that "some floats" is only a small minority of values - not, say, 50% of them.
A small minority, but apparently it includes all the floats you’re likely to use. It seems the insight is that you only need 8 bits of exponent in most cases. (And single-precision floating point only has 8 bits of exponent.)
Most double-precision floats are never used because they have high exponents.
50% means you only get 1 tag bit.
also you totally can fit 64 bit pointers inside a NaN. 46 bit pointers are only 48 bits and you have 53 bits of NaN payload. (you also could get an extra 3 bits if you only allow storing 8 byte aligned pointers unboxed)
> 50% means you only get 1 tag bit.
That's enough to distinguish between "unboxed float" and "something else", where the latter can have additional tag bits.
> [64-bit] pointers are only 48 bits and you have 53 bits of NaN payload.
The paper specifically talks about support for "high memory addresses that do not fit in 48 bits". If you don't have to handle those high addresses, I don't think this approach has any benefits compared to NaN-boxing.
It’s clever, but not random chance. That would be too much of a coincidence. They rotate the floats to make it happen the way they want.
It’s hardly random that only 8 bits of exponent are needed for many calculations.
Thank you for the clear explanation!
Already linked yesterday, btw, with a comment by one of the authors:
https://news.ycombinator.com/item?id=42249609
> For instance, NaN-tagging prevents (or largely complicates) optimizations relying on stack allocations. The stack uses high memory addresses that do not fit in 48 bits unless encoded relative to the location of the stack segment.
Er, what? The paper says they tested on a Xeon CPU, so x86-64, running Linux. On traditional x86-64, all pointers fit in 48 bits, period. Stack memory is no exception. More recently the architecture was extended to allow 56-bit pointers, but my impression is that Linux (like other OSes) keeps them disabled by default in userspace. According to the documentation [1]:
> Not all user space is ready to handle wide addresses. [..] To mitigate this, we are not going to allocate virtual address space above 47-bit by default.
So how would the stack end up above 47 bits? Is the documentation out of date?
[1] https://docs.kernel.org/arch/x86/x86_64/5level-paging.html
The idea in the paper is really cool.
People who enjoyed this might also like to read how Apple used tagged pointers for short strings in Objective-C [0]. I think that's when I first learned about tagged pointers. NaN-boxing was mindblowing for me. I love this kind of stuff.
[0] https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-point...
Another cool thing that seems related: exploiting alignment to free up N bits in a 'pointer' representation, because your values have to be aligned. The JVM does this to expand the set of possible addresses representable in 32 bits: https://shipilev.net/jvm/anatomy-quarks/23-compressed-refere...
So, for example, with 3 bits of alignment required, the first valid address for a pointer to point to after 0x0 is 0x8, and after that is 0x10, but you represent those as 0x1 and 0x2 respectively, and use a shift to get back the actual address (0x1 << 3 = 0x8, actual address). I think this is gestured at in section 1.1 of the paper, sort of, except they envision using the space thus freed for tags, rather than additional bits. (Which only makes sense if your address is 32 bits anyway, rather than 64 as in the paper: no one has 67-bit addresses. So saving 3 bits doesn't buy you anything. I think.)
> Aligning all heap-allocated values to 64-bit machine words conveniently frees the low bits of pointers to store a 3-bit tag.
It's interesting which runtimes exploit the extra space for what reasons! Definitely makes more sense to have the extra address space on 32 bits compared to 64. I wonder if the extra addresses are specific to JVM / not something that works well in the C family?
Well in C you have non-aligned pointers, because you can have pointers to things that aren't objects and might not be aligned (e.g. individual chars or shorts). In Java everything is at least 8-byte-aligned, you can't store a loose char/short/int on the heap (it has to go in a boxed object that's 8-byte-aligned, though the compiler will do this semi-automatically) and you can't take a pointer to an individual element of an array.
If you applied the C standard strictly, you could use a JVM-style representation for pointers to longs, pointers, and structs that start with longs and pointers, so you could theoretically have an implementation where those pointers were shorter. But you'd have to convert back and forth when casting to and from void* (and char*), and in practice C people expect to be able to cast a long* to int, cast that to void*, and get the same result as casting long* to void*, even though doing that and using it is undefined behaviour according to the standard.