Love to see negative results published, so so important.
Please let's all go towards research procedure that enforces the submission of the hypothesis before any research is allowed to commence and includes enforced publishing regardless of results.
I think the missing piece here is that JavaScriptCore (JSC) and other such systems don't just use inline caching to speed up dynamic accesses; they use them as profiling feedback.
So, anytime you have an IC in interpreter, baseline, or lightly optimized code, then that IC is monitored to see how polymorphic it gets, and that data is fed back into the optimization pipeline.
Just having an IC as a dead-end, where you don't use it for profiling is way less profitable than having an IC that feeds into profiling.
Well on dynamic languages the ICs do give a nice order of magnitude speed-up by themselves, since the guard eliminates a whole hashtable (or linear) lookup instead of (in this case) a single memory indirection.
But yeah - on spidermonkey we found that orienting our ICs towards being stable and easy to work with, as opposed to just being fast, ended up leading to a much better design.
This is a nice result though. Negative, but good that they published it.
What would be a good next step is some QEMU-style transformation, pull out basic blocks, profile them for both hotness, and incoming arguments at function starts, and dynamic dispatch targets.. then use that to re-compile the whole thing using method-jit and in particular inlining across call-paths with GVN and DCE applied.
I kind of expect the results to be very positive, just based on intuition.. but it'd be cool to see how it actually turned out.
Apologies if this response seems aggressive - this is just a topic I'm very passionate about :)
I think technically in languages like scheme, the opportunity would be to optimize other sorts of dispatches. The classic dispatch mechanism in scheme is the "assoc" style list-of-pairs lookup.
In this case, the "monomorphization" would be extracting runtime information on the common lookups that are taken. This is doable in a language like scheme, but it requires identifying parts of data structures that are less likely to change over time - where it makes sense to lift them up into hidden types and effectively make them "static".
Imagine if you could designate particular `(list (cons key value) ...)` value as "optimizable" - maybe even with a macro/function call : `(optimizable ((a 1) (b 2) ...))`
This would build a hidden shape for the association's "backbone" and give you back a shaped assoc list, and then you would be able to optimize all uses of `(assoc ...)` on lists of that kind in the same way you optimize shaped objects.
A plumbing exposed version of this would just let you do `(let my-shape (make-shape '(prop1 prop2 ...)))` and later `(my-shape '(1 2 ...))` to build the shape-optimized association list.
It's kind of neat when you realize that almost everything the runtime type-inference regime in a JIT compiler does.. is enable eliding lookups across data structures where we can assume that some part of that data structure is "more static" than other parts.
In JS that data structure is a linked-list-of-hashtables, where the hashtable keys and the linked list backbone are expected to be stable.
But the general idea applies to literally any structure you'd want to do lookups across. If you can extract a 'conserved shape', you can apply this optimization.
If your language is static enough, then static devirt is profitable enough that you can stop there.
If your language is dynamic enough, then PICs are the main driver of devirt. (Though all PIC-based systems couple that with static analysis and that static analysis is powerful enough that it can sometimes devirt without the PICs' help.)
I meant "dual" in the analogous sense, not as strict mathematical duals where one could replace the other. That both are solving devirt from opposite ends. I read @bjoli's comment with the analogous connotation.
Your last sentence, would be if Rust used a PIC to optimize calls to dyn Traits?
Indeed, this was literally the conclusion of the first paper that introduced polymorphic inline caches.
I'll add that the real benefit of ICs isn't just that compiled code is specialized to the seen types, but the fact that deoptimization guards are inserted, which split diamonds in the original general cases so that multiple downstream checks become redundant. So specialization is not just a local simplification but a global simplification to all dominated code in the context of the compilation unit.
SpiderMonkey actually ditched most of the profiling stuff in favor of transpiling the ICs generated at runtime into the IR used by the optimizing compiler, inlining them into the functions when they're used, and then sending the whole thing through the usual optimization pipeline. The technique is surprisingly effective.
I don't know what the best reference to link for this would be, but look up "Warp" or "WarpMonkey" if you're interested.
WarpMonkey doesn't get rid of the profiling stuff - the profiling is inherent in ICs - we keep hitcounts and other information for various paths taken through code (including ICs) and use that to guide compilation later.
Warp's uniqueness is in how it implements the ICs. The design goal when we built baseline JIT in SpiderMonkey was to split the code and data components of ICs. At the time, we were looking at V8 ICs which where basically compiled code blocks with the relevant parameter data (e.g. pointer to the hidden type to compare against) baked into the code.
We wanted to segregate the data from the code - e.g. so that all ShapedGetProp ICs can have a data stub with a pointer to their own shape, but share a pointer to the code. Effectively your ICs end up looking like small linked lists of C++ pure virtual objects (without the vtable indirection and just a single code pointer hanging off of the stub).
Originally the "shared code" was emitted by a bunch of statically defined methods that emitted a fixed bit of assembly (one for each kind of stub). That became unweildy as we added more stubs, so CacheIR was designed. CacheIR was a simple bytecode language that the stubs could express their logic in, which would get compiled down to machine code. The CacheIR bytecode would be a key to the compiled stubcode.
That let stubs generate arbitrary CacheIR for their logic, but still share code between stubs that emitted the same logic.
That led to the idea of Warp, where we noticed that one could build the input for an optimized method-jit compiler just by combining the profiling info that stubs produced, and the CacheIR bytecode for those stubs.
Normally you'd start from bytecode, build an SSA, then do a pass where you apply type information.
With Warp, the design simplifies into stitching together a bunch of CacheIR chunks which already embed the optimization information you care about, and then compiling that.
Ultimately it does the same thing as the other JITs, but it goes about it in a really nice and clean way. It kind of expresses some of the ideas that Maxime Boisvert-Chevalier was exploring in their work with basic block versioning.
> Normally you'd start from bytecode, build an SSA, then do a pass where you apply type information.
> With Warp, the design simplifies into stitching together a bunch of CacheIR chunks which already embed the optimization information you care about, and then compiling that.
This is what I meant by ditching most of the profiling stuff; I suppose I should have said "type inference stuff" to be more precise.
> Originally the "shared code" was emitted by a bunch of statically defined methods that emitted a fixed bit of assembly (one for each kind of stub). That became unweildy as we added more stubs, so CacheIR was designed.
I remember all too well :) I worked on the first pass at implementing megamorphic caches into the original stub generators that spit out (macro)assembly directly, before we had CacheIR. So much code duplication...
> that branch prediction got better in the ‘10s and a bunch of techniques that didn’t work before do now.
They got better than they had any right to be, but then we found out that Spectre & Meltdown were vulnerabilities rather than optimizations.
For example, a switch based interpreter was fast as a CGOTO one for a brief period between 2012 and 2018, but suddenly got slower again as the CPUs could no longer rely on branch prediction to do prefetching.
The modern VM technique looks almost exactly like what the original PIC papers talked about in the 90s. There are some details that are different, but I'm not sure that the details come down to exploiting changes in branch prediction efficiency. I think the things that changed come mostly down to the fact that the original PIC paper was a first stab by a small team whereas modern VMs involve decades of engineering by larger teams (so everything that could get more complex as a consequence of tuning did get more complex).
So, while it's true that microarches changed in a lot of ways, the overall implications for how you build VMs are not so big.
Are you still using a threaded interpreter main loop? That didn't really come around until the mid 90's and I've been hearing for about ten years now that it's not a clear win anymore due to predictors being able to read through two levels of indirection.
The last time I ran the experiment of having a single jump, it was slower than jump-per-opcode-handler.
It's true that predictors are able to see through multiple levels, but a threaded interpreter gives them plus one level, and that ends up mattering as much as it always did.
The main thing we're doing differently in SM is that all of our ICs are generated using a simple linear IR (CacheIR), instead of generating machine code directly. For example, a simple monomorphic property access (obj.prop) would be GuardIsObject / GuardShape / LoadSlot. We can then lower that IR directly to MIR for the optimizing compiler.
It gives us a lot of flexibility in choosing what to guard, without having to worry as much about getting out of sync between the baseline ICs and the optimizer's frontend. To a first approximation, our CacheIR generators are the single source of truth for speculative optimization in SpiderMonkey, and the rest of the engine just mechanically follows their lead.
There are also some cool tricks you can do when your ICs have associated IR. For example, when calling a method on a superclass, with receivers of a variety of different subclasses, you often end up with a set of ICs that all 1. Guard the different shapes of the receiver objects, 2. Guard the shared shape of the holder object, then 3. Do the call. When we detect that, we can mechanically walk the IR, collect the different receiver shapes, and generate a single stub-folded IC that instead guards against a list of shapes. The cool thing is that stub folding doesn't care whether it's looking at a call IC, or a GetProp IC, or anything else: so long as the only thing that differs is the a single GuardShape, you can make the transformation.
> The main thing we're doing differently in SM is that all of our ICs are generated using a simple linear IR (CacheIR)
JSC calls this PolymorphicAccess. It’s a mini IR with a JIT that tries to emit optimal code based on this IR. Register allocation and everything, just for a very restricted IR.
It’s been there since I don’t remember when. I wrote it ages ago and then it has evolved into a beast.
Taking a quick look at the JSC code, the main difference is that CacheIR is more pervasive and load-bearing. Even monomorphic cases go through CacheIR.
The main justification for CacheIR isn't that it enables us to do optimizations that can't be done in other ways. It's just a convenient unifying framework.
One of the last pieces of really good advice I got before I gave up on writing a programming language myself is that if you instrument the paths that are already expected to be slow, you can get most of the value of instrumentation with a fraction of the cost per call. Because people avoid making the slow calls, and if they don’t the app was going to be slower anyway so why not an extra couple percent? Versus the fast path where the instrumentation may be a quarter or more of runtime.
The answer is always more feedback. I am excited about DNN powered static profilers. The training data will come from the JIT saving the results of their experiments.
I've been thinking about what it would look like for something like this to be done for the profiling that you get from ICs, not the profiling you get from branch weights or basic block counts.
They're quite different. Two big differences:
- My best estimate is that speculating on type state (i.e. what you get from ICs) is a value bet only if you're right about 99.9% of the time (or even 99.999% - depends on your compiler/runtime architecture). I think you can get profit from branch weights if they are right less than 99.9% of the time.
- Speculating on type state means having semantically rich profiling information. It's not just a bunch of numbers. You need the profiler to describe a type to you, like: "I expect this access to see objects with fields x, y, z (in that order) and it has a prototype that has fields a, b, c, which then has a prototype with fields e, f, g".
Guarded devirtualization is different from the speculation that I'm talking about.
To me, speculation is where the fail path exits the optimized code.
To handle JS's dynamism, guarding is usually not worth it (though JSC has the ability to do that, if the profiling says that the fail path is probable). I believe that most of HotSpot's perf comes from speculation rather than guarded devirt.
I know exactly how I would do that to JavaScriptCore, but that’s maybe mostly due to the fact that I designed most of the bits you’d have to instrument.
Not sure if it’s the easiest overall.
I’m easy to look up if you want to pick my brain about JSC
In my Sciter, that uses QuickJS (no JIT), instead of JIT I've added C compiler. That means we can add not just JS modules but C modules too:
import * as cmod from "./cmodule.c"
Such cmodule will be compiled and executed on the fly into native code. Idea is simple each language is good for specific tasks. JS is flexible and C is performant - just use right tool that is most optimal for a task.
c-modules play two major roles: FFI and number crunching code execution.
Sciter uses TCC compiler and runtime.
In total size of QuickJS + TCC binary bundle 500k + 220k = 720k.
Interesting project! After clicking around on the website:
> In almost 10 years, Sciter UI engine has become the secret weapon of success for some of the most prominent antivirus products on the market: Norton Antivirus and Internet Security, Comodo Internet Security, ESET Antivirus, BitDefender Antivirus, and others.
What an intriguingly specific niche of customer! How come all these different anti-virus companies decided to use your platform?
chasing inline cache micro-optimizations with dynamic binary modification is a dead end. modern CPUs are laughing at our outdated compiler tricks. maybe it's time to accept that clever hacks won’t outrun silicon.
You don't, there are equal trade offs. JIT might use more memory because of what it does at the runtime, but it is also the exact reason it is faster to start. A good trade off is just using the type of languages best suited for the workload.
the people who came up with this are obviously brilliant but being french myself, I really wonder why no one is proof-reading the english, this gives an overall bad impression of the work imho
Being a native English speaker I absolutely love reading and listening to speakers of English as a second language. Speaking is actually a subspecies of singing, and it's always cool to hear the same old lyrics remixed to a new melody and a new beat.
English has no 'correct' way to be written or spoken, nor does it need one, nor would it benefit from one, therefore, nor should it have one.
Speakers of English as a second language: you are what makes English a great language.
> English has no 'correct' way to be written or spoken, nor does it need one, nor would it benefit from one, therefore, nor should it have one.
There may be no 'correct' way, but there are plenty of 'incomprehensible' ways. I once encountered a research paper that had clearly [0] been translated word-for-word from French into English and made no sense until I translated it word-for-word back to French...
[0]: actually it was only clear after I realised I should attempt the reverse translation ;)
Sure, but frankly, I've heard plenty of people speaking the most flawless King's English who didn't make any sense at all.
re: translated math papers: haha we've all been there. Once I had to read a bunch of 70's-era papers from Russian Mathematicians. The translators, bless their hearts, I'm sure knew everything there was to know about Dickens and Dostoevsky, but it was clear they had no clue what the math was all about :-)
Oh well, Math is the universal language, right? chuckle
That's a beautiful way of seeing things! Unfortunately, as you're well aware I'm sure, most people do not share your idyllic view of polyglots and, for better or worse, they will assume that bad english = bad quality work. And bad doesn't have to mean mistakes. Just an unusual wording is enough to throw the average person off, in my experience.
I'm not as worried about those who have ears, but don't hear, as I am about the effect LLM's will have on English.
Grammerly was bad enough. One of my oldest friends is from Transylvania, and he could tell such great stories in his eastern-european accent and cadence. When he collected those stories into a book, he ran everything through grammerly, and the book reads like a soulless newscaster ;-(
When people start en mass to run their prose through LLM's to "correct" it, English will lose one of its main arteries.
This seems poorly grounded. In fact almost three decades after the release of the Java HotSpot runtime we're still waiting for even one system to produce the promised advantages. I guess consensus is that V8 has come closest?
But the reality is that hand-optimized AoT builds remain the gold standard for performance work.
What makes this very complicated is that 1) language design plays a big part in performance and 2) CPUs change as well and this anecdotally seems to have more impact on interpreter than compiler performance.
With regards to 1), consider optimizing Javascript. It doesn't have machine integers, so you have to do a bunch of analysis to figure when something is being used as an integer and then you can make that code fast. There are many other cases. Python is even worse in this regard. In comparison AOT compiled languages are usually designed to be fast, so they make tradeoffs that favour performance at the cost of some level of abstraction / expressivity. The JVM is somewhere in the middle, and so is its performance.
> you have to do a bunch of analysis to figure when something is being used as an integer and then you can make that code fast
It doesn't get much attention now that WASM exists, but asm.js essentially solves this, so a more head-to-head comparison ought to be possible. (V8 has optimisations specific to asm.js.)
With all respect that sounds like excuse-making. I mean, yeah, Javascript and JVM and .NET are slower runtimes than C or Rust[1]. Nonetheless that's the world we live in, and if you have a performance-sensitive problem to solve you pick up rustc or g++ and not a managed runtime. If that's wrong, someone's got to actually show that it's wrong.
[1] Maybe Go or Swift would be more apples-to-apples. But even then are there clear benchmarks showing Kotlin or C# beating similar AoT code? If anything the general sense of the community is that Go is faster than Java.
Excuses for what? I'm not the elected representative for JIT compiled languages, sworn to defend them. There are technical reasons they tend to be slower. I was sketching some of them.
I think the above comments are because JIT gets so much positive press, someone wandering in from outside could be mistaken for thinking that JIT isn't coming 2nd in a two-man race with AOT.
I've been around long enough to hear that Java and JIT are gonna overtake C++ any day now.
When things are performance-sensitive, you want things to be tunable and predictable. Good luck playing with the JIT if you rely that for performance...
> But the reality is that hand-optimized AoT builds remain the gold standard for performance work.
It's considerably more complicated than that. After working in this area for 25 years, I have vacillated between extremes over decades-long arcs. The reality is much more nuanced than a four sentence HN comment. Profile and measure and stare at machine code. If you don't do that daily, it's hand waving and having hunches.
I'd also point out that it's an ever-shifting landscape. What was slow yesterday might not be today.
In my experience, while there are some negatives of the runtime selected, the vast majority of performance is won or lost at the algorithm level. It really doesn't matter that rust can be faster than ruby if you chose an O(n^3) algorithm. Rust will run the O(n^3) algorithm faster than ruby, for sure, but ruby will beat the pants off of rust if someone converts it into an O(n) algorithm.
It only starts mattering if you've already have an O(n) algorithm. However, in my experience, a LOT of programmers are happy writing a n^3 and moving on to the next task without considering what this will do.
for (i : foo) {
for (j : foo) {
for (k : foo) {
bar(i, j, k)
}
}
}
If the code runs 100 times faster, it might just offset even highly inefficient implementation.
> a LOT of programmers are happy writing a n^3
I have the same experience.
Unfortunately, and this is an issue I keep fighting with in some .NET communities, languages like C, C++ and Rust tend to select for engineers which are more likely to care about writing reasonably efficient implementation.
At the same time, higher-level languages sometimes can almost encourage the blindness to the real world model of computation, the execution implications be damned. In such languages you will encounter way more people who will write O(n^3) algorithm and will fight you tooth and nail to keep it that way because they have zero understanding of the fundamentals, wasting the heroic effort by the runtime/compiler to keep it running acceptably well.
The tests on this website run for very little time indeed. They use input values that e.g. the original BehnchmarksGame suggests for validation before running for a longer time to get actual performance (another case in point - surely you want to run a web server longer than a couple hundred milliseconds). In my experience the data there does not always replicate to what you get in real world scenarios. It’s an unfortunate tradeoff because the benchmark runs when you want to support so many languages will take a very long time, but in my opinion it’s better to have numbers that are useful for making informed decisions over pure quantity.
If you have something specific in mind, it can be more interesting to build and measure the exact scenario you’d like to know about (standard caveats to benchmarking properly apply), which is quite easier if you have, say, just two languages.
> At the same time, higher-level languages sometimes can almost encourage the blindness to the real world model of computation, the execution implications be damned. In such languages you will encounter way more people who will write O(n^3) algorithm and will fight you tooth and nail to keep it that way because they have zero understanding of the fundamentals, wasting the heroic effort by the runtime/compiler to keep it running acceptably well.
I would say this tracks. I spent some time doing research on JVMs and largely found that, for example, the Java community largely values building OO abstractions around program logic and structuring things in ways that generally require more runtime logic and safety checks. For example, Java generics are erased and replaced with casts in the bytecode. Those checks the JVM has to blindly perform in the interpreter and any lower compiler tiers that don't inline. Only when you get to opt tiers does the compiler start to inline enough to see enough context to be able to statically eliminate these checks.
Of course Java hides these checks because they should never fail, so it's easy to forget they are there. As an API designer and as a budding library writer, Java programmers learn to use these abstractions, like the nicety of generics, in order to make things more general and usable. That's the higher priority, and when the decision criteria comes down to performance versus reuse, programmers choose reuse all the time.
> that generally require more runtime logic and safety checks.
These safety checks and runtime logic are a constant factor in the performance of a given java application.
Further, they are mostly miniscule compared to other things you are paying for by using java. The class check requires loading the object from main memory/cpu cache but the actual check is a single cycle cmp check. Considering the fact that that object will then be immediately used by the following code (hence warm in cache) the price really isn't comparable to the already existing overhead of reaching down into ram to fetch it.
I won't say there aren't algorithms that will suffer, particularly if you are doing really heavy data crunching that extra check can be somewhat murder. However, in the very grand scheme of things, it's nothing compared to all the memory loading that goes on in a typical java application.
That is to say, the extra class cast on an `ArrayList<Point>` is nothing compared to the cost of the memory lookups when you do
int sum = 0;
for (var point : points) {
sum += point.x + point.y + point.z
}
> The class check requires loading the object from main memory/cpu cache but the actual check is a single cycle cmp check.
Only a guard or, possibly, a final class type-check (at least it's the case for sealed classes or exact type comparisons in .NET). For anything else this will be more involved due to inheritance.
Obviously for any length above ~3 this won't dominate but JVM type system defaults don't make all this any easier.
> If the code runs 100 times faster, it might just offset even highly inefficient implementation.
That's the danger of algorithmic complexity. 100 is a constant factor. As n grows, the effects of that constant factor are overwhelmed by the algorithmic inefficiency. For something like an n^3, it really doesn't take long before the algorithm dominates the performance over any language considerations.
To put it in perspective, if the rust n^3 algorithm is 100x faster with n=10 compared to the ruby O(n) algorithm, it takes only around n=50 before ruby ends up faster than rust.
For the most part, the runtime complexity of languages is a relatively fixed factor. That's why algorithmic complexity ends up being extremely important, more so than the language choice.
I used to not think this way, but the more I've dealt with performance tuning the more I've come to realize the wisdom of Big Oh in day to day programming. Too many devs will justify an O(n^2) algorithm as being "simple" even though the O(n) algorithm is often just adding a new hashtable to the mix.
JVM implementations, especially those with PGO feedback loop across runs do quite well.
Likewise modern Android, runs reasonably well with its mix of JIT, AOT with JIT PGO metadata, baseline profiles shared across devices via Play Store.
The gold standard for anyone that actually cares about ultimate performance is hand written Assembly, naturally guided with a profilers capable to measure everything that the CPU is doing like VTune.
> We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of -O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their -O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
Maybe the "allocate as little as possible, use sun.misc.Unsafe a lot, have lots of long-lived global arrays" style of Java programming some high-performance Java programs use would get close to being a good stand-in.
I'm pretty sure the major penalty is the lack of inline objects (thus requiring lots of pointer-chasing), rather than GC. GC will give you unpredictable performance but allocation has a penalty regardless of approach.
For purely array-based code, JIT is the only factor and Java can seriously compete with C/C++. It's impossible to be competitive with idiomatic Java code though.
C# has structs (value classes) if you bother to use them. Java has something allegedly similar with Project Valhalla, but my observation indicates they completely misunderstand the problem and their solution is worthless.
Inline objects is a huge hit that hopefully gets solved soon.
But I'd posit that one programming pattern enabled by a GC is concurrent programming. Java can happily create a bunch of promises/futures, throw them at a thread pool and let that be crunched without worrying about the lifetimes of stuff sent in or returned from these futures.
For single threaded stuff, C probably has java beat on memory and runtime. However, for multithreading it's simply easier to crank out correct threaded code in Java than it is in C.
IMO, this is what has made Go so appealing. Go doesn't produce the fastest binaries on the planet, but it does have nice concurrency primitives and a GC that makes highly parallel processes easy.
I am extremely skeptical of any "concurrency made easy" claims. Rust has probably the best claim in that area but it's still pretty limited, and comes at the cost of making it hard to write normal code.
I wouldn't (and didn't) say "easy" just "easier". The thing that makes rust concurrency so gnarly to work with is the lifetime battles you have to do in order to make it work. That's still better than C/C++ because you aren't dealing with accidental memory corruption when the wrong thread frees memory at the wrong time.
For languages like rust/C/C++, thread safe data structures are VERY hard to pull off. That's because tracking the lifetime of things tracked by the data structures introduces all sorts of heartburn.
What GCed languages buy you is not needing to track those lifetimes. Yes, you can still have data races and shared memory mutation problems, but you can also write thread safe data structures like caches without the herculean efforts needed to communicate with users of the cache who owns what when and when that thing dies.
The best that Rust and C++ can do to solve these problems is ARC and a LOT of copying.
> Java has something allegedly similar with Project Valhalla, but my observation indicates they completely misunderstand the problem and their solution is worthless.
Hahah spicy take, I'd be interested to hear more. It definitely might not bode well that they opened the "Generics Reification" talk at JVMLS 2024 with "we have no answers, only problems."
I'm not going to investigate it again, there was probably more than this. But from what I recall:
* The compiler isn't actually guaranteed to store them by value at all. Basically, they're written to be an "optional extension" rather than a first-class feature in their own right.
* Everything is forced to be immutable, so you can't actually write most of the code that would take advantage of value types in the first place. Hot take: functional programming is mainly a bad workaround for languages that don't support value types in the first place.
The immutable thing is actually being sold as a strength, i.e. "you write your nice clean immutable code, and if you've tagged it as a value type or flattenable, the compiler will figure out it doesn't need a new allocation and will update the existing value inline." I think they see it as in keeping with the Java culture of "you get very good performance for straightforward code" but I definitely agree there's a hazard of introducing an unnecessary impedance mismatch.
It will be a lot of work for the compiler to unspill modifications on any non-trivial data structure and reduce register pressure, especially since it's Java's first foray into structs :)
(I suppose if the list of things you can do with structs is very short, this will be nowhere near as useful but will also reduce the amount of compiler changes)
The whole point is to introduce value types without a .NET Framework vs .NET Core schism.
Random jars taken out of Maven central should be able to continue to execute in a Valhala enabled JVM, without changes in their original semantics, while at the same time being able somewhat to take advantage of the Valhala world.
Naturally there is always the issue of APIs that no longer exist like Thread.stop(), but that is orthogonal to the idea to have binary libraries keep working in a new value aware world.
There are tons of compiler changes, minimal semantic changes and keeping bytecode ABI as much as possible is the engineering challenge.
> Is there a JITing C compiler, or something like that?
Yes, for example, compiling C to JavaScript (or asm.js, etc. [0]) leads to the C code being JITed.
And yes, there are definitely benchmarks where this is actually faster. Any time that a typical C compiler can't see that inlining makes sense is such an opportunity, as the JIT compiler sees the runtime behavior. The speedup can be very large. However, in practice, most codebases get inlined well using clang/gcc/etc., leaving few such opportunities.
[0] This may also happen when compiling C to WebAssembly, but it depends on whether the wasm runtime does JIT optimizations - many do not and instead focus on static optimizations, for simplicity.
I love how folks worship GCC and clang compiler extensions as C and C++ or UNIX compiler vendors in general, including embedded RTOS toolchains, but when Microsoft does it, for whatever reason doesn't count.
I certainly don't "worship" any compiler, and am pretty quick to point out non-standard extensions in people's code. But C++/CLI goes far, far beyond extensions, and becomes a completely different language to C++, both syntactically and semantically.
Just like Linux kernel can only be compiled with GCC, or compilers that equally implement the same language extensions that aren't at all C, not being part of C23 ISO/IEC 9899:2024, including compiler switches that change C semantics as strict provenance.
If you want to further discuss what is what, lets see how up to date is your ISO knowledge, versus the plethora of extensions across C and C++ compilers.
I think that's completely silly framing; you can AOT compile any code better—or at least, just as well—if you already know how you want it to perform at runtime. Any efficiency gain would necessarily need to be in the context of total productivity.
It's literally the framing of the linked article though, which takes as a prior that JIT compilers are already ahead of AoT toolchains. And... they aren't!
They are comparing Javascript JIT to Javascript AOT, to avoid the issue of language design.
"The fastest contemporary JavaScript implementations use JIT compilers [27]. ... However, JIT compilers may not be desirable or simply not available in some
contexts, for instance if programs are to be executed on platforms with too limited resources or if the architecture forbids dynamic code generation. Ahead of time (AoT) compilers offer a response to these situations.
Hopc [25] is an AoT JavaScript-to-C compiler. Its performance is often in the same
range as that of the fastest JIT compilers but its impossibility to adapt the code executed at runtime seems a handicap for some patterns and benchmarks [27]."
In the context of JS it's reasonable to think that JIT may have an advantage, as the language is difficult to statically analyse.
> This gives them an advantage when compared to Ahead-of-Time (AoT) compilers that must choose the code to generate once for all.
I assumed they were talking about the general case, which is nearly useless to discuss. I just kind of filtered it out as internecine bickering amongst academics. The actual data are still interesting tho.
There were many many statements made that JIT compilers could be faster than AOT compilers because they had more information to use at runtime - originally this was mostly aimed at Java/HotSpot which has not, in practice, significantly displaced languages like C or C++ (or these days Rust) from high-performance work.
Yeah those statements were overly optimistic and I don’t think they’re representative of what most people in the JIT field think. It’s also not what I as a JIT engineer would have promised you.
The actual promise is just: JITs make dynamic languages faster and they are better at doing that than AOTs. I think lots of systems have delivered on that promise.
Yup, agreed, in the case of dynamic languages it's much clearer and the evidence is a lot more favourable.
The linked article doesn't help here because the abstract only mentions Javascript in the context of their work to prove their concept, but the body of the paper is clearer that it is discussing JIT vs AOT in the context of Javascript specifically.
I think their findings are applicable to lots of languages where the fastest known implementation is JIT based.
Not all “JIT dominant” languages rely on ICs as part of the JIT’s performance story, but enough of them do that it’s worth studying.
And JS happens to be the language where ICs have been taken the furthest, in terms of just how many different ways have been investigated and how many person years went into tuning them. So in some sense they’re picking the hardest fight. I think that’s a good thing.
I concur here. 20 years ago I was a JIT cheerleader and in the intervening time I've realized that you're only going to get the super-optimized hot inner loop perfect after the JIT and runtime has chugged through a ton of other slop that tends to make programs bloated and slow. And the Java ecosystem in particular has a tendency to build a ton of ceremony and abstractions that the runtime system has to boil away, but can only really managed to do so with deep inlining and a lot of optimizations, many of which are speculative.
> JITs make dynamic languages faster and they are better at doing that than AOTs
Yeah, i'm curious how well JIT works on languages with less dynamism. Perhaps a combination of AOT + JIT on a strong statically typed language might provide the best of both worlds. Though I suppose PGO kinda does that.
HotSpot definitely has delivered on that too. It's a super dynamic runtime with reflection and randomly loaded jars even if Java the language is terse.
If you pit virtual-call-heavy code written in C++ against C#, C# will come out on top every single time, especially if you consume dynamically-linked dependencies or if you can't afford to wait until the heat death of the universe when all the LTO plugins finish their job.
Or if you use SIMD-heavy path and your binary is built against, say, X86-64-v2/3 and the target supports AVX512, .NET will happily use the entirety of AVX512 thanks to JIT even when still using 256b-wide operations (i.e. bespoke path that uses Vector256) with AVX512VL. This tends to surpass what you can get out of runtime dispatch under LLVM.
re: Java challenges - those stem from the JVM bytecode being a very difficult optimization target i.e. every call is virtual by default with complex dispatch strategy, everything is a heap-allocated object by default save for very few primitives, generics lose type information and are never monomorphized - PGO optimization through tiered compilation and resulting guarded devirtualization and object escape analysis is something that reclaims performance in Java and makes it acceptable. C and C++ with templates are a massively easier optimization target for GCC, and GCC does not operate under strict time constraints too. Therefore we have the results that we do.
Also interesting data points here if you'd like to look at AOT capabilities of higher-level languages:
The paper seems to start with the bizarre assumption that AOT compilers need to "catch up" with JIT compilers and in particular that they benefit from inline caches for member lookup.
But the fact is that AOT compilers are usually for well-designed languages that don't need those inline caches because the designers properly specified a type system that would guarantee a field is always stored at the same offset.
They might benefit from a similar mechanism to predict branches and indirect branches (i.e. virtual/dynamic dispatch), but they already have compile-time profile-guided optimization and CPU branch predictors at runtime.
Furthermore, for branches that always go in one direction except for seldom changes, there are also frameworks like the Linux kernel "alternatives" and "static key" mechanisms.
So the opportunity for making things better with self-modifying code is limited to code where all those mechanisms don't work well, and the overhead of the runtime profiling is worth it.
Which is probably very rare and not worth bringing it a JIT compiler for.
AOTs are behind JITs for dynamic languages. It’s super interesting to study how to make AOTs catch up in that space, so I’m glad that these folks made an effort and reported the results!
Love to see negative results published, so so important.
Please let's all go towards research procedure that enforces the submission of the hypothesis before any research is allowed to commence and includes enforced publishing regardless of results.
I think the missing piece here is that JavaScriptCore (JSC) and other such systems don't just use inline caching to speed up dynamic accesses; they use them as profiling feedback.
So, anytime you have an IC in interpreter, baseline, or lightly optimized code, then that IC is monitored to see how polymorphic it gets, and that data is fed back into the optimization pipeline.
Just having an IC as a dead-end, where you don't use it for profiling is way less profitable than having an IC that feeds into profiling.
Well on dynamic languages the ICs do give a nice order of magnitude speed-up by themselves, since the guard eliminates a whole hashtable (or linear) lookup instead of (in this case) a single memory indirection.
But yeah - on spidermonkey we found that orienting our ICs towards being stable and easy to work with, as opposed to just being fast, ended up leading to a much better design.
This is a nice result though. Negative, but good that they published it.
What would be a good next step is some QEMU-style transformation, pull out basic blocks, profile them for both hotness, and incoming arguments at function starts, and dynamic dispatch targets.. then use that to re-compile the whole thing using method-jit and in particular inlining across call-paths with GVN and DCE applied.
I kind of expect the results to be very positive, just based on intuition.. but it'd be cool to see how it actually turned out.
A minor nitpick: ICs don't give that much benefit in monomorphic languages like scheme.
Apologies if this response seems aggressive - this is just a topic I'm very passionate about :)
I think technically in languages like scheme, the opportunity would be to optimize other sorts of dispatches. The classic dispatch mechanism in scheme is the "assoc" style list-of-pairs lookup.
In this case, the "monomorphization" would be extracting runtime information on the common lookups that are taken. This is doable in a language like scheme, but it requires identifying parts of data structures that are less likely to change over time - where it makes sense to lift them up into hidden types and effectively make them "static".
Imagine if you could designate particular `(list (cons key value) ...)` value as "optimizable" - maybe even with a macro/function call : `(optimizable ((a 1) (b 2) ...))`
This would build a hidden shape for the association's "backbone" and give you back a shaped assoc list, and then you would be able to optimize all uses of `(assoc ...)` on lists of that kind in the same way you optimize shaped objects.
A plumbing exposed version of this would just let you do `(let my-shape (make-shape '(prop1 prop2 ...)))` and later `(my-shape '(1 2 ...))` to build the shape-optimized association list.
It's kind of neat when you realize that almost everything the runtime type-inference regime in a JIT compiler does.. is enable eliding lookups across data structures where we can assume that some part of that data structure is "more static" than other parts.
In JS that data structure is a linked-list-of-hashtables, where the hashtable keys and the linked list backbone are expected to be stable.
But the general idea applies to literally any structure you'd want to do lookups across. If you can extract a 'conserved shape', you can apply this optimization.
Couldn't PICs and monomorphization be seen as duals? They are both solving the problem of how to make polymorphic code have fewer branches.
PICs are the core mechanism of monomorphization in the VMs that do it
I was thinking of static monomorphization as in Rust.
But then there isn't a duality.
If your language is static enough, then static devirt is profitable enough that you can stop there.
If your language is dynamic enough, then PICs are the main driver of devirt. (Though all PIC-based systems couple that with static analysis and that static analysis is powerful enough that it can sometimes devirt without the PICs' help.)
I meant "dual" in the analogous sense, not as strict mathematical duals where one could replace the other. That both are solving devirt from opposite ends. I read @bjoli's comment with the analogous connotation.
Your last sentence, would be if Rust used a PIC to optimize calls to dyn Traits?
Indeed, this was literally the conclusion of the first paper that introduced polymorphic inline caches.
I'll add that the real benefit of ICs isn't just that compiled code is specialized to the seen types, but the fact that deoptimization guards are inserted, which split diamonds in the original general cases so that multiple downstream checks become redundant. So specialization is not just a local simplification but a global simplification to all dominated code in the context of the compilation unit.
https://bibliography.selflanguage.org/pics.html for those following along
SpiderMonkey actually ditched most of the profiling stuff in favor of transpiling the ICs generated at runtime into the IR used by the optimizing compiler, inlining them into the functions when they're used, and then sending the whole thing through the usual optimization pipeline. The technique is surprisingly effective.
I don't know what the best reference to link for this would be, but look up "Warp" or "WarpMonkey" if you're interested.
WarpMonkey doesn't get rid of the profiling stuff - the profiling is inherent in ICs - we keep hitcounts and other information for various paths taken through code (including ICs) and use that to guide compilation later.
Warp's uniqueness is in how it implements the ICs. The design goal when we built baseline JIT in SpiderMonkey was to split the code and data components of ICs. At the time, we were looking at V8 ICs which where basically compiled code blocks with the relevant parameter data (e.g. pointer to the hidden type to compare against) baked into the code.
We wanted to segregate the data from the code - e.g. so that all ShapedGetProp ICs can have a data stub with a pointer to their own shape, but share a pointer to the code. Effectively your ICs end up looking like small linked lists of C++ pure virtual objects (without the vtable indirection and just a single code pointer hanging off of the stub).
Originally the "shared code" was emitted by a bunch of statically defined methods that emitted a fixed bit of assembly (one for each kind of stub). That became unweildy as we added more stubs, so CacheIR was designed. CacheIR was a simple bytecode language that the stubs could express their logic in, which would get compiled down to machine code. The CacheIR bytecode would be a key to the compiled stubcode.
That let stubs generate arbitrary CacheIR for their logic, but still share code between stubs that emitted the same logic.
That led to the idea of Warp, where we noticed that one could build the input for an optimized method-jit compiler just by combining the profiling info that stubs produced, and the CacheIR bytecode for those stubs.
Normally you'd start from bytecode, build an SSA, then do a pass where you apply type information.
With Warp, the design simplifies into stitching together a bunch of CacheIR chunks which already embed the optimization information you care about, and then compiling that.
Ultimately it does the same thing as the other JITs, but it goes about it in a really nice and clean way. It kind of expresses some of the ideas that Maxime Boisvert-Chevalier was exploring in their work with basic block versioning.
Thanks for the more complete explanation!
> Normally you'd start from bytecode, build an SSA, then do a pass where you apply type information.
> With Warp, the design simplifies into stitching together a bunch of CacheIR chunks which already embed the optimization information you care about, and then compiling that.
This is what I meant by ditching most of the profiling stuff; I suppose I should have said "type inference stuff" to be more precise.
> Originally the "shared code" was emitted by a bunch of statically defined methods that emitted a fixed bit of assembly (one for each kind of stub). That became unweildy as we added more stubs, so CacheIR was designed.
I remember all too well :) I worked on the first pass at implementing megamorphic caches into the original stub generators that spit out (macro)assembly directly, before we had CacheIR. So much code duplication...
Ah, sorry for the misinterpretation!
Also, we may have overlapped on the team :)
My understanding is that branch prediction got better in the ‘10s and a bunch of techniques that didn’t work before do now.
> that branch prediction got better in the ‘10s and a bunch of techniques that didn’t work before do now.
They got better than they had any right to be, but then we found out that Spectre & Meltdown were vulnerabilities rather than optimizations.
For example, a switch based interpreter was fast as a CGOTO one for a brief period between 2012 and 2018, but suddenly got slower again as the CPUs could no longer rely on branch prediction to do prefetching.
The modern VM technique looks almost exactly like what the original PIC papers talked about in the 90s. There are some details that are different, but I'm not sure that the details come down to exploiting changes in branch prediction efficiency. I think the things that changed come mostly down to the fact that the original PIC paper was a first stab by a small team whereas modern VMs involve decades of engineering by larger teams (so everything that could get more complex as a consequence of tuning did get more complex).
So, while it's true that microarches changed in a lot of ways, the overall implications for how you build VMs are not so big.
Are you still using a threaded interpreter main loop? That didn't really come around until the mid 90's and I've been hearing for about ten years now that it's not a clear win anymore due to predictors being able to read through two levels of indirection.
The last time I ran the experiment of having a single jump, it was slower than jump-per-opcode-handler.
It's true that predictors are able to see through multiple levels, but a threaded interpreter gives them plus one level, and that ends up mattering as much as it always did.
We talk about this a bit in our CacheIR paper. Search for "IonBuilder".
https://www.mgaudet.ca/s/mplr23main-preprint.pdf
It sounds like you're describing something similar to what the other JS VMs do
The main thing we're doing differently in SM is that all of our ICs are generated using a simple linear IR (CacheIR), instead of generating machine code directly. For example, a simple monomorphic property access (obj.prop) would be GuardIsObject / GuardShape / LoadSlot. We can then lower that IR directly to MIR for the optimizing compiler.
It gives us a lot of flexibility in choosing what to guard, without having to worry as much about getting out of sync between the baseline ICs and the optimizer's frontend. To a first approximation, our CacheIR generators are the single source of truth for speculative optimization in SpiderMonkey, and the rest of the engine just mechanically follows their lead.
There are also some cool tricks you can do when your ICs have associated IR. For example, when calling a method on a superclass, with receivers of a variety of different subclasses, you often end up with a set of ICs that all 1. Guard the different shapes of the receiver objects, 2. Guard the shared shape of the holder object, then 3. Do the call. When we detect that, we can mechanically walk the IR, collect the different receiver shapes, and generate a single stub-folded IC that instead guards against a list of shapes. The cool thing is that stub folding doesn't care whether it's looking at a call IC, or a GetProp IC, or anything else: so long as the only thing that differs is the a single GuardShape, you can make the transformation.
> The main thing we're doing differently in SM is that all of our ICs are generated using a simple linear IR (CacheIR)
JSC calls this PolymorphicAccess. It’s a mini IR with a JIT that tries to emit optimal code based on this IR. Register allocation and everything, just for a very restricted IR.
It’s been there since I don’t remember when. I wrote it ages ago and then it has evolved into a beast.
Taking a quick look at the JSC code, the main difference is that CacheIR is more pervasive and load-bearing. Even monomorphic cases go through CacheIR.
The main justification for CacheIR isn't that it enables us to do optimizations that can't be done in other ways. It's just a convenient unifying framework.
This is unique to SpiderMonkey, as far as I'm aware.
One of the last pieces of really good advice I got before I gave up on writing a programming language myself is that if you instrument the paths that are already expected to be slow, you can get most of the value of instrumentation with a fraction of the cost per call. Because people avoid making the slow calls, and if they don’t the app was going to be slower anyway so why not an extra couple percent? Versus the fast path where the instrumentation may be a quarter or more of runtime.
The answer is always more feedback. I am excited about DNN powered static profilers. The training data will come from the JIT saving the results of their experiments.
That's an exciting direction!
Profile Guided Optimization without Profiles: A Machine Learning Approach
https://www.semanticscholar.org/paper/Profile-Guided-Optimiz...
Very cool!
I've been thinking about what it would look like for something like this to be done for the profiling that you get from ICs, not the profiling you get from branch weights or basic block counts.
They're quite different. Two big differences:
- My best estimate is that speculating on type state (i.e. what you get from ICs) is a value bet only if you're right about 99.9% of the time (or even 99.999% - depends on your compiler/runtime architecture). I think you can get profit from branch weights if they are right less than 99.9% of the time.
- Speculating on type state means having semantically rich profiling information. It's not just a bunch of numbers. You need the profiler to describe a type to you, like: "I expect this access to see objects with fields x, y, z (in that order) and it has a prototype that has fields a, b, c, which then has a prototype with fields e, f, g".
For the .NET JIT, at least, speculation on types seems beneficial even if we're only right maybe 30% of the time.
See eg https://github.com/dotnet/runtime/blob/main/docs/design/core...
(where this is presented as a puzzle)....
Guarded devirtualization is different from the speculation that I'm talking about.
To me, speculation is where the fail path exits the optimized code.
To handle JS's dynamism, guarding is usually not worth it (though JSC has the ability to do that, if the profiling says that the fail path is probable). I believe that most of HotSpot's perf comes from speculation rather than guarded devirt.
Which JIT would be the easiest to implement to log this information? A time series LLM should be able to analyze it and give predictions.
Looks like PYPY is the most extensible.
https://rpython.readthedocs.io/en/latest/logging.html
And that the JIT is rebuilt from rpython, so it is fairly open to extension.
I know exactly how I would do that to JavaScriptCore, but that’s maybe mostly due to the fact that I designed most of the bits you’d have to instrument.
Not sure if it’s the easiest overall.
I’m easy to look up if you want to pick my brain about JSC
What a generous offer. I'll spend some time reading your papers first. Thank you.
Slightly orthogonal...
In my Sciter, that uses QuickJS (no JIT), instead of JIT I've added C compiler. That means we can add not just JS modules but C modules too:
Such cmodule will be compiled and executed on the fly into native code. Idea is simple each language is good for specific tasks. JS is flexible and C is performant - just use right tool that is most optimal for a task.c-modules play two major roles: FFI and number crunching code execution.
Sciter uses TCC compiler and runtime.
In total size of QuickJS + TCC binary bundle 500k + 220k = 720k.
For the comparison: V8 is of 40mb size.
https://sciter.com/c-modules-in-sciter/ https://sciter.com/here-we-go/
Interesting project! After clicking around on the website:
> In almost 10 years, Sciter UI engine has become the secret weapon of success for some of the most prominent antivirus products on the market: Norton Antivirus and Internet Security, Comodo Internet Security, ESET Antivirus, BitDefender Antivirus, and others.
What an intriguingly specific niche of customer! How come all these different anti-virus companies decided to use your platform?
Even if I am not a big C fan, the idea is rather cool, it is a bit like having C++ on .NET via C++/CLI.
Tangentially, fuck yeah, negative results, just as good as positive ones
Amen. This paper is worth more than all of the fraudulent, unreproducible papers we are inundated with, put together and squared.
chasing inline cache micro-optimizations with dynamic binary modification is a dead end. modern CPUs are laughing at our outdated compiler tricks. maybe it's time to accept that clever hacks won’t outrun silicon.
JITs typically are too broken for compiler tricks so I don't think it's time to accept that just yet.
what is the better approach?
You don't, there are equal trade offs. JIT might use more memory because of what it does at the runtime, but it is also the exact reason it is faster to start. A good trade off is just using the type of languages best suited for the workload.
the people who came up with this are obviously brilliant but being french myself, I really wonder why no one is proof-reading the english, this gives an overall bad impression of the work imho
Being a native English speaker I absolutely love reading and listening to speakers of English as a second language. Speaking is actually a subspecies of singing, and it's always cool to hear the same old lyrics remixed to a new melody and a new beat.
English has no 'correct' way to be written or spoken, nor does it need one, nor would it benefit from one, therefore, nor should it have one.
Speakers of English as a second language: you are what makes English a great language.
> English has no 'correct' way to be written or spoken, nor does it need one, nor would it benefit from one, therefore, nor should it have one.
There may be no 'correct' way, but there are plenty of 'incomprehensible' ways. I once encountered a research paper that had clearly [0] been translated word-for-word from French into English and made no sense until I translated it word-for-word back to French...
[0]: actually it was only clear after I realised I should attempt the reverse translation ;)
Sure, but frankly, I've heard plenty of people speaking the most flawless King's English who didn't make any sense at all.
re: translated math papers: haha we've all been there. Once I had to read a bunch of 70's-era papers from Russian Mathematicians. The translators, bless their hearts, I'm sure knew everything there was to know about Dickens and Dostoevsky, but it was clear they had no clue what the math was all about :-)
Oh well, Math is the universal language, right? chuckle
> Speaking is actually a subspecies of singing, and it's always cool to hear the same old lyrics remixed to a new melody and a new beat.
What a lovely take on this topic! :)
(does this imply you're a fellow believer in the hypothesis that singing evolved before language?)
That's a beautiful way of seeing things! Unfortunately, as you're well aware I'm sure, most people do not share your idyllic view of polyglots and, for better or worse, they will assume that bad english = bad quality work. And bad doesn't have to mean mistakes. Just an unusual wording is enough to throw the average person off, in my experience.
I'm not as worried about those who have ears, but don't hear, as I am about the effect LLM's will have on English.
Grammerly was bad enough. One of my oldest friends is from Transylvania, and he could tell such great stories in his eastern-european accent and cadence. When he collected those stories into a book, he ran everything through grammerly, and the book reads like a soulless newscaster ;-(
When people start en mass to run their prose through LLM's to "correct" it, English will lose one of its main arteries.
It's a preprint.
This seems poorly grounded. In fact almost three decades after the release of the Java HotSpot runtime we're still waiting for even one system to produce the promised advantages. I guess consensus is that V8 has come closest?
But the reality is that hand-optimized AoT builds remain the gold standard for performance work.
The benchmarks I have seen show Hotspot is ahead of V8. E.g. https://stefan-marr.de/papers/oopsla-larose-et-al-ast-vs-byt...
What makes this very complicated is that 1) language design plays a big part in performance and 2) CPUs change as well and this anecdotally seems to have more impact on interpreter than compiler performance.
With regards to 1), consider optimizing Javascript. It doesn't have machine integers, so you have to do a bunch of analysis to figure when something is being used as an integer and then you can make that code fast. There are many other cases. Python is even worse in this regard. In comparison AOT compiled languages are usually designed to be fast, so they make tradeoffs that favour performance at the cost of some level of abstraction / expressivity. The JVM is somewhere in the middle, and so is its performance.
With regards to 2) this paper is an example, as is https://inria.hal.science/hal-01100647/file/InterpIBr-hal.pd...
> you have to do a bunch of analysis to figure when something is being used as an integer and then you can make that code fast
It doesn't get much attention now that WASM exists, but asm.js essentially solves this, so a more head-to-head comparison ought to be possible. (V8 has optimisations specific to asm.js.)
https://en.wikipedia.org/wiki/Asm.js
asm.js solves this in the specific case where somebody has compiled their C/C++ code to target asm.js. It doesn't solve it for arbitrary JS code.
asm.js is more like a weird frontend to wasm than a dialect of JS.
No, if you just use the standard JavaScript cast to integer incantation, |0, v8 will optimize it. asm.js is valid JavaScript.
With all respect that sounds like excuse-making. I mean, yeah, Javascript and JVM and .NET are slower runtimes than C or Rust[1]. Nonetheless that's the world we live in, and if you have a performance-sensitive problem to solve you pick up rustc or g++ and not a managed runtime. If that's wrong, someone's got to actually show that it's wrong.
[1] Maybe Go or Swift would be more apples-to-apples. But even then are there clear benchmarks showing Kotlin or C# beating similar AoT code? If anything the general sense of the community is that Go is faster than Java.
Excuses for what? I'm not the elected representative for JIT compiled languages, sworn to defend them. There are technical reasons they tend to be slower. I was sketching some of them.
I think the above comments are because JIT gets so much positive press, someone wandering in from outside could be mistaken for thinking that JIT isn't coming 2nd in a two-man race with AOT.
I've been around long enough to hear that Java and JIT are gonna overtake C++ any day now.
The title on this article doesn't help.
https://devblogs.microsoft.com/oldnewthing/20060731-15/?p=30...
https://blog.codinghorror.com/on-managed-code-performance-ag...
And that was 2005. Modern .NET is much, much faster.
> If anything the general sense of the community is that Go is faster than Java.
Faster where?
When things are performance-sensitive, you want things to be tunable and predictable. Good luck playing with the JIT if you rely that for performance...
Good luck with AOT as well, unless you hardcode the target hardware, like game consoles.
> But the reality is that hand-optimized AoT builds remain the gold standard for performance work.
It's considerably more complicated than that. After working in this area for 25 years, I have vacillated between extremes over decades-long arcs. The reality is much more nuanced than a four sentence HN comment. Profile and measure and stare at machine code. If you don't do that daily, it's hand waving and having hunches.
I'd also point out that it's an ever-shifting landscape. What was slow yesterday might not be today.
In my experience, while there are some negatives of the runtime selected, the vast majority of performance is won or lost at the algorithm level. It really doesn't matter that rust can be faster than ruby if you chose an O(n^3) algorithm. Rust will run the O(n^3) algorithm faster than ruby, for sure, but ruby will beat the pants off of rust if someone converts it into an O(n) algorithm.
It only starts mattering if you've already have an O(n) algorithm. However, in my experience, a LOT of programmers are happy writing a n^3 and moving on to the next task without considering what this will do.
You may be underestimating the degree of difference in performance between Ruby and Rust.
Here's comparison of Ruby with JS, and Rust is of course faster still: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
If the code runs 100 times faster, it might just offset even highly inefficient implementation.
> a LOT of programmers are happy writing a n^3
I have the same experience.
Unfortunately, and this is an issue I keep fighting with in some .NET communities, languages like C, C++ and Rust tend to select for engineers which are more likely to care about writing reasonably efficient implementation.
At the same time, higher-level languages sometimes can almost encourage the blindness to the real world model of computation, the execution implications be damned. In such languages you will encounter way more people who will write O(n^3) algorithm and will fight you tooth and nail to keep it that way because they have zero understanding of the fundamentals, wasting the heroic effort by the runtime/compiler to keep it running acceptably well.
I've found this website provides different results: https://programming-language-benchmarks.vercel.app/typescrip...
It also shows different Ruby implementations. I've tried truffleruby myself and it's blazing fast on long-running CPU-intensive tasks.
The tests on this website run for very little time indeed. They use input values that e.g. the original BehnchmarksGame suggests for validation before running for a longer time to get actual performance (another case in point - surely you want to run a web server longer than a couple hundred milliseconds). In my experience the data there does not always replicate to what you get in real world scenarios. It’s an unfortunate tradeoff because the benchmark runs when you want to support so many languages will take a very long time, but in my opinion it’s better to have numbers that are useful for making informed decisions over pure quantity.
If you have something specific in mind, it can be more interesting to build and measure the exact scenario you’d like to know about (standard caveats to benchmarking properly apply), which is quite easier if you have, say, just two languages.
> At the same time, higher-level languages sometimes can almost encourage the blindness to the real world model of computation, the execution implications be damned. In such languages you will encounter way more people who will write O(n^3) algorithm and will fight you tooth and nail to keep it that way because they have zero understanding of the fundamentals, wasting the heroic effort by the runtime/compiler to keep it running acceptably well.
I would say this tracks. I spent some time doing research on JVMs and largely found that, for example, the Java community largely values building OO abstractions around program logic and structuring things in ways that generally require more runtime logic and safety checks. For example, Java generics are erased and replaced with casts in the bytecode. Those checks the JVM has to blindly perform in the interpreter and any lower compiler tiers that don't inline. Only when you get to opt tiers does the compiler start to inline enough to see enough context to be able to statically eliminate these checks.
Of course Java hides these checks because they should never fail, so it's easy to forget they are there. As an API designer and as a budding library writer, Java programmers learn to use these abstractions, like the nicety of generics, in order to make things more general and usable. That's the higher priority, and when the decision criteria comes down to performance versus reuse, programmers choose reuse all the time.
> that generally require more runtime logic and safety checks.
These safety checks and runtime logic are a constant factor in the performance of a given java application.
Further, they are mostly miniscule compared to other things you are paying for by using java. The class check requires loading the object from main memory/cpu cache but the actual check is a single cycle cmp check. Considering the fact that that object will then be immediately used by the following code (hence warm in cache) the price really isn't comparable to the already existing overhead of reaching down into ram to fetch it.
I won't say there aren't algorithms that will suffer, particularly if you are doing really heavy data crunching that extra check can be somewhat murder. However, in the very grand scheme of things, it's nothing compared to all the memory loading that goes on in a typical java application.
That is to say, the extra class cast on an `ArrayList<Point>` is nothing compared to the cost of the memory lookups when you do
> The class check requires loading the object from main memory/cpu cache but the actual check is a single cycle cmp check.
Only a guard or, possibly, a final class type-check (at least it's the case for sealed classes or exact type comparisons in .NET). For anything else this will be more involved due to inheritance.
Obviously for any length above ~3 this won't dominate but JVM type system defaults don't make all this any easier.
I'm not an expert but I think that the compiler requires the exact class on insertion so at use it's just a check.
> If the code runs 100 times faster, it might just offset even highly inefficient implementation.
That's the danger of algorithmic complexity. 100 is a constant factor. As n grows, the effects of that constant factor are overwhelmed by the algorithmic inefficiency. For something like an n^3, it really doesn't take long before the algorithm dominates the performance over any language considerations.
To put it in perspective, if the rust n^3 algorithm is 100x faster with n=10 compared to the ruby O(n) algorithm, it takes only around n=50 before ruby ends up faster than rust.
For the most part, the runtime complexity of languages is a relatively fixed factor. That's why algorithmic complexity ends up being extremely important, more so than the language choice.
I used to not think this way, but the more I've dealt with performance tuning the more I've come to realize the wisdom of Big Oh in day to day programming. Too many devs will justify an O(n^2) algorithm as being "simple" even though the O(n) algorithm is often just adding a new hashtable to the mix.
JVM implementations, especially those with PGO feedback loop across runs do quite well.
Likewise modern Android, runs reasonably well with its mix of JIT, AOT with JIT PGO metadata, baseline profiles shared across devices via Play Store.
The gold standard for anyone that actually cares about ultimate performance is hand written Assembly, naturally guided with a profilers capable to measure everything that the CPU is doing like VTune.
I agree, the "JITs can be faster because X Y Z" arguments have never turned into "JITs are actually faster".
Maybe that's because JIT is almost always used in languages that were slowed in the first place, e.g. due to GC.
Is there a JITing C compiler, or something like that? Would that even make sense?
Binary Translation could be seen as a generalized JIT for native code.
Dynamo: A Transparent Dynamic Optimization System https://dl.acm.org/doi/pdf/10.1145/358438.349303
> We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of -O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their -O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
https://www.semanticscholar.org/paper/Dynamo%3A-a-transparen...
Maybe the "allocate as little as possible, use sun.misc.Unsafe a lot, have lots of long-lived global arrays" style of Java programming some high-performance Java programs use would get close to being a good stand-in.
I'm pretty sure the major penalty is the lack of inline objects (thus requiring lots of pointer-chasing), rather than GC. GC will give you unpredictable performance but allocation has a penalty regardless of approach.
For purely array-based code, JIT is the only factor and Java can seriously compete with C/C++. It's impossible to be competitive with idiomatic Java code though.
C# has structs (value classes) if you bother to use them. Java has something allegedly similar with Project Valhalla, but my observation indicates they completely misunderstand the problem and their solution is worthless.
Inline objects is a huge hit that hopefully gets solved soon.
But I'd posit that one programming pattern enabled by a GC is concurrent programming. Java can happily create a bunch of promises/futures, throw them at a thread pool and let that be crunched without worrying about the lifetimes of stuff sent in or returned from these futures.
For single threaded stuff, C probably has java beat on memory and runtime. However, for multithreading it's simply easier to crank out correct threaded code in Java than it is in C.
IMO, this is what has made Go so appealing. Go doesn't produce the fastest binaries on the planet, but it does have nice concurrency primitives and a GC that makes highly parallel processes easy.
I am extremely skeptical of any "concurrency made easy" claims. Rust has probably the best claim in that area but it's still pretty limited, and comes at the cost of making it hard to write normal code.
I wouldn't (and didn't) say "easy" just "easier". The thing that makes rust concurrency so gnarly to work with is the lifetime battles you have to do in order to make it work. That's still better than C/C++ because you aren't dealing with accidental memory corruption when the wrong thread frees memory at the wrong time.
For languages like rust/C/C++, thread safe data structures are VERY hard to pull off. That's because tracking the lifetime of things tracked by the data structures introduces all sorts of heartburn.
What GCed languages buy you is not needing to track those lifetimes. Yes, you can still have data races and shared memory mutation problems, but you can also write thread safe data structures like caches without the herculean efforts needed to communicate with users of the cache who owns what when and when that thing dies.
The best that Rust and C++ can do to solve these problems is ARC and a LOT of copying.
To be fair, .NET has way more than just structs. But yes, they are a starting point.
> Java has something allegedly similar with Project Valhalla, but my observation indicates they completely misunderstand the problem and their solution is worthless.
Hahah spicy take, I'd be interested to hear more. It definitely might not bode well that they opened the "Generics Reification" talk at JVMLS 2024 with "we have no answers, only problems."
I'm not going to investigate it again, there was probably more than this. But from what I recall:
* The compiler isn't actually guaranteed to store them by value at all. Basically, they're written to be an "optional extension" rather than a first-class feature in their own right.
* Everything is forced to be immutable, so you can't actually write most of the code that would take advantage of value types in the first place. Hot take: functional programming is mainly a bad workaround for languages that don't support value types in the first place.
The immutable thing is actually being sold as a strength, i.e. "you write your nice clean immutable code, and if you've tagged it as a value type or flattenable, the compiler will figure out it doesn't need a new allocation and will update the existing value inline." I think they see it as in keeping with the Java culture of "you get very good performance for straightforward code" but I definitely agree there's a hazard of introducing an unnecessary impedance mismatch.
It will be a lot of work for the compiler to unspill modifications on any non-trivial data structure and reduce register pressure, especially since it's Java's first foray into structs :)
(I suppose if the list of things you can do with structs is very short, this will be nowhere near as useful but will also reduce the amount of compiler changes)
The whole point is to introduce value types without a .NET Framework vs .NET Core schism.
Random jars taken out of Maven central should be able to continue to execute in a Valhala enabled JVM, without changes in their original semantics, while at the same time being able somewhat to take advantage of the Valhala world.
Naturally there is always the issue of APIs that no longer exist like Thread.stop(), but that is orthogonal to the idea to have binary libraries keep working in a new value aware world.
There are tons of compiler changes, minimal semantic changes and keeping bytecode ABI as much as possible is the engineering challenge.
> Is there a JITing C compiler, or something like that?
Yes, for example, compiling C to JavaScript (or asm.js, etc. [0]) leads to the C code being JITed.
And yes, there are definitely benchmarks where this is actually faster. Any time that a typical C compiler can't see that inlining makes sense is such an opportunity, as the JIT compiler sees the runtime behavior. The speedup can be very large. However, in practice, most codebases get inlined well using clang/gcc/etc., leaving few such opportunities.
[0] This may also happen when compiling C to WebAssembly, but it depends on whether the wasm runtime does JIT optimizations - many do not and instead focus on static optimizations, for simplicity.
C++/CLI is one example, it is C++, not C, but example holds.
Now the money question: can anyone come up with a benchmark where, due to the JIT, C++/CLI runs faster than normal C++ compiled for the same CPU?
Writing a program where a jit version is faster than the aot version is just an exercise in knowing the limitations of AOT.
People have been doing runtime code generation for a very long time for exactly this reason.
A general implementation faster than, say, g++ is a completely different beast.
It is not C++ (or C) but a Microsoft invented language - which is OK, but don't confuse it with C++ anymore than MS have already done
I love how folks worship GCC and clang compiler extensions as C and C++ or UNIX compiler vendors in general, including embedded RTOS toolchains, but when Microsoft does it, for whatever reason doesn't count.
Two weights, two measures.
I certainly don't "worship" any compiler, and am pretty quick to point out non-standard extensions in people's code. But C++/CLI goes far, far beyond extensions, and becomes a completely different language to C++, both syntactically and semantically.
Just like Linux kernel can only be compiled with GCC, or compilers that equally implement the same language extensions that aren't at all C, not being part of C23 ISO/IEC 9899:2024, including compiler switches that change C semantics as strict provenance.
If you want to further discuss what is what, lets see how up to date is your ISO knowledge, versus the plethora of extensions across C and C++ compilers.
> I guess consensus is that V8 has come closest?
V8 better than the JVM? Insanity, maybe it can come to within an order of magnitude in terms of performance.
Comes closest to realizing the concept of a JIT that is better than AOT.
I think that's completely silly framing; you can AOT compile any code better—or at least, just as well—if you already know how you want it to perform at runtime. Any efficiency gain would necessarily need to be in the context of total productivity.
> I think that's completely silly framing
It's literally the framing of the linked article though, which takes as a prior that JIT compilers are already ahead of AoT toolchains. And... they aren't!
They are comparing Javascript JIT to Javascript AOT, to avoid the issue of language design.
"The fastest contemporary JavaScript implementations use JIT compilers [27]. ... However, JIT compilers may not be desirable or simply not available in some contexts, for instance if programs are to be executed on platforms with too limited resources or if the architecture forbids dynamic code generation. Ahead of time (AoT) compilers offer a response to these situations.
Hopc [25] is an AoT JavaScript-to-C compiler. Its performance is often in the same range as that of the fastest JIT compilers but its impossibility to adapt the code executed at runtime seems a handicap for some patterns and benchmarks [27]."
In the context of JS it's reasonable to think that JIT may have an advantage, as the language is difficult to statically analyse.
> This gives them an advantage when compared to Ahead-of-Time (AoT) compilers that must choose the code to generate once for all.
I assumed they were talking about the general case, which is nearly useless to discuss. I just kind of filtered it out as internecine bickering amongst academics. The actual data are still interesting tho.
> Java HotSpot runtime we're still waiting for even one system to produce the promised advantages.
What promised advantages are you waiting on?
There are lots of systems that have architectures that are similar to HotSpot, or that surpass it in some way. V8 is just one.
There were many many statements made that JIT compilers could be faster than AOT compilers because they had more information to use at runtime - originally this was mostly aimed at Java/HotSpot which has not, in practice, significantly displaced languages like C or C++ (or these days Rust) from high-performance work.
Yeah those statements were overly optimistic and I don’t think they’re representative of what most people in the JIT field think. It’s also not what I as a JIT engineer would have promised you.
The actual promise is just: JITs make dynamic languages faster and they are better at doing that than AOTs. I think lots of systems have delivered on that promise.
Yup, agreed, in the case of dynamic languages it's much clearer and the evidence is a lot more favourable.
The linked article doesn't help here because the abstract only mentions Javascript in the context of their work to prove their concept, but the body of the paper is clearer that it is discussing JIT vs AOT in the context of Javascript specifically.
I think their findings are applicable to lots of languages where the fastest known implementation is JIT based.
Not all “JIT dominant” languages rely on ICs as part of the JIT’s performance story, but enough of them do that it’s worth studying.
And JS happens to be the language where ICs have been taken the furthest, in terms of just how many different ways have been investigated and how many person years went into tuning them. So in some sense they’re picking the hardest fight. I think that’s a good thing.
I concur here. 20 years ago I was a JIT cheerleader and in the intervening time I've realized that you're only going to get the super-optimized hot inner loop perfect after the JIT and runtime has chugged through a ton of other slop that tends to make programs bloated and slow. And the Java ecosystem in particular has a tendency to build a ton of ceremony and abstractions that the runtime system has to boil away, but can only really managed to do so with deep inlining and a lot of optimizations, many of which are speculative.
> JITs make dynamic languages faster and they are better at doing that than AOTs
Indeed.
Yeah, i'm curious how well JIT works on languages with less dynamism. Perhaps a combination of AOT + JIT on a strong statically typed language might provide the best of both worlds. Though I suppose PGO kinda does that.
HotSpot definitely has delivered on that too. It's a super dynamic runtime with reflection and randomly loaded jars even if Java the language is terse.
I guess distributed systems and OS GUI frameworks aren't it then.
If you pit virtual-call-heavy code written in C++ against C#, C# will come out on top every single time, especially if you consume dynamically-linked dependencies or if you can't afford to wait until the heat death of the universe when all the LTO plugins finish their job.
Or if you use SIMD-heavy path and your binary is built against, say, X86-64-v2/3 and the target supports AVX512, .NET will happily use the entirety of AVX512 thanks to JIT even when still using 256b-wide operations (i.e. bespoke path that uses Vector256) with AVX512VL. This tends to surpass what you can get out of runtime dispatch under LLVM.
re: Java challenges - those stem from the JVM bytecode being a very difficult optimization target i.e. every call is virtual by default with complex dispatch strategy, everything is a heap-allocated object by default save for very few primitives, generics lose type information and are never monomorphized - PGO optimization through tiered compilation and resulting guarded devirtualization and object escape analysis is something that reclaims performance in Java and makes it acceptable. C and C++ with templates are a massively easier optimization target for GCC, and GCC does not operate under strict time constraints too. Therefore we have the results that we do.
Also interesting data points here if you'd like to look at AOT capabilities of higher-level languages:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Hand-optimized AoT builds with solid profile-based feedback, right?
> we're still waiting for even one system to produce the promised advantages
To be clear, successful JIT do runtime profiling+optimization, at significant benefit.
But on net, JIT languages are slower.
It is a valid question to ask whether AOT binaries can selectively use runtime optimizations, making them even faster.
The paper seems to start with the bizarre assumption that AOT compilers need to "catch up" with JIT compilers and in particular that they benefit from inline caches for member lookup.
But the fact is that AOT compilers are usually for well-designed languages that don't need those inline caches because the designers properly specified a type system that would guarantee a field is always stored at the same offset.
They might benefit from a similar mechanism to predict branches and indirect branches (i.e. virtual/dynamic dispatch), but they already have compile-time profile-guided optimization and CPU branch predictors at runtime.
Furthermore, for branches that always go in one direction except for seldom changes, there are also frameworks like the Linux kernel "alternatives" and "static key" mechanisms.
So the opportunity for making things better with self-modifying code is limited to code where all those mechanisms don't work well, and the overhead of the runtime profiling is worth it.
Which is probably very rare and not worth bringing it a JIT compiler for.
AOTs are behind JITs for dynamic languages. It’s super interesting to study how to make AOTs catch up in that space, so I’m glad that these folks made an effort and reported the results!
The trade offs between them are meaningful. Also Rust ain't bad for an AOT.