Performance of Rust Language [pdf]

81 points | by tanelpoder 12 hours ago

57 comments

jandrewrogers 8 hours ago
I would summarize it thusly: Rust is roughly as performant as C. This matches my experience and Rust is more ergonomic than C in many regards. The caveat is that modern C++ is notably more performant than C and by implication Rust. This also matches my experience for both C and Rust.
I think most of this is attributable to the ergonomics of compile-time expressiveness. C++ can effortlessly do things that require mountains of ugly boilerplate and macros in C or Rust. In principle they can express the same things but no one wants to write or deal with that ugly boilerplate so the equivalency is never realized in real code bases.
Zig is interesting because it slots in as a C-like language with a competent and expressive compile-time story. I don’t use Zig but I recognize its game.
[-]
- nsajko 25 minutes ago
  Julia is another contender. Julia code can be as performant as C++ code, but Julia code may be even more elegant than C++. Even without accounting for Julia's metaprogramming features, the compile-time expressiveness is top-notch.
  It shares some of the same drawbacks as C++, though. The language is extremely powerful, so while it is easy to write performant code, it is also easy for non experts to write very suboptimal code.
- elcritch 2 hours ago
  Nim also has top notch meta programming, probably more so than Zig. You can easily do loop unrolling, specialization, etc. For example Constantine, which is a constant time crypto library that outperforms C, etc.
  To me programming Rust feels so limiting due to lack of good compile time meta programming with types. That’s the key.
- tialaramex 14 minutes ago
  You say you are "summarizing" something but instead you seem to have just injected your opinion that C++ is "notably more performant than C and by implication Rust".
  It's true that you can express many things in C++ -- the problem is that the language deliberately doesn't distinguish whether the things you've expressed are nonsense, so you might well have written total nonsense and you only find out when, much later, diagnosing a real world event you discover oh, this is nonsense, why did this even compile? Well sorry, it was "more performant" to allow nonsense.
- vvanders 2 hours ago
  I'm surprised to see someone putting forth the argument that templates are easier to use than macros. I've found the opposite and in many cases the monomorphization of templates to explode code size which has a fairly material impact on performance in my domains. Debugging macros with cargo expand is infinitely easier than debugging template errors.
  While you can write high performance C++ my experience is that many people will reach for shared_ptr and their like while Rust will force them into proper structure/ownership as Arc and their like have a lot higher friction.
- flohofwoe 5 hours ago
  > The caveat is that modern C++ is notably more performant than C and by implication Rust.
  This really needs more realworld evidence to back up the claim. In the end the important optimizations happen down in the Clang optimizer passes on the LLVM IR, and those optimizations are the same across C, C++, Rust (or Zig for that matter) - assuming of course that the optimizer can see all function bodies, which in C can be achieved via LTO or alternatively via 'unity builds'.
  If the output of one of those languages differs so much (on an LLVM-based compiler) that there are noticeable performance differences I would start investigating whether there's a compile/link setting missing somewhere instead.
  [-]
  - logicchains 4 hours ago
    OP said "C++ can effortlessly do things that require mountains of ugly boilerplate and macros in C or Rust". In theory Rust can be as performant but some things are much less ergonomic to do in Rust macros than in C++ metaprogramming, so often end up not being done.
    [-]
    - flohofwoe 3 hours ago
      Often that's also because the programmer doesn't know how the optimizer will help them to remove inactive code also in C code. As a simple example, when I have a 'general' bulk-getter function in C which returns a large struct with tons of values but the caller is only interested in one value, the compiler will 'collapse' the entire function call to a single memory access (if it can see the function body, but this is where LTO comes in), e.g.:
      https://www.godbolt.org/z/n3Y54Yhqr
      This is basically the gist of C++ 'zero cost abstraction', but C-style (the bulk of what enables C++ zero-cost-abstraction doesn't happen up in the language, but down in the optimization passes).
- afdbcreid 8 hours ago
  Is C++ more performant than C? I find this hard to believe. C++ does not have any construct that cannot be replicated, or is not common, in C. The only candidate is using virtualization and void* pointers instead of monomorphized generics which some C code does for the lack of better options, but that's not a problem in Rust as well.
  If anything, Rust has the potential to be more performant than C due to its aliasing rules (C has `restrict` but it's rarely used, standard C++ does not have even that). The current perf stats show it does make Rust code faster but just a little bit, although we don't utilize the full optimization potential currently (LLVM does not do many possible optimizations here, and `noalias` is weaker than Rust's aliasing rules). It can also affect autovectorization, and if it does the effect could be dramatic.
  [-]
  - jandrewrogers 7 hours ago
    Modern C++ metaprogramming materially impacts performance in practice. I’ve done performance engineering for decades in both C and modern C++ and I would assert that the difference isn’t arguable.
    The poor applicability of auto-vectorization is another area where C++ is strong. You can transparently codegen e.g. AVX512 from intrinsics directly in C++ in contexts that would be opaque to auto-vectorization and difficult to generalize in C. This allows you to get some degree of “auto-vectorization” where the compiler can’t see it because it works at the wrong level of abstraction.
    With sufficiently heroic efforts you can write C that matches the performance of C++. I’m not arguing that. Virtually no one writes C to that standard, including myself when I was writing high-performance C because the effort was too high, so it is a bit of a strawman.
    It is the difference between theory and practice. All code bases have a finite budget. C++ can do a lot more optimization in the same budget as C.
    [-]
    - globalnode 6 hours ago
      So youre saying the metaprogramming facilities of C++ allow the compiler to better optimise high level human readable code more effectively than C. Thats a fair point and one I'd never even thought of before, I always thought C was faster because of things like v-tables and all that stuff.
      [-]
      - adrian_b 5 hours ago
        In C++, nobody would want or need to use virtual functions in high-performance computational applications, while in the C language structures with virtual function tables that are accessed explicitly by the programmer are in widespread use wherever suitable, for instance in many popular open-source C programs, like the Linux kernel or the debugger gdb.
        So the existence of virtual function tables is not a differentiator between C++ and C.
        The data types with virtual function tables are just the implementation method for sum types that is dual to tagged unions. Both virtual function tables and tagged unions can be implemented in C and in most other programming languages that do not have intrinsic support for them, but they require explicit boilerplate code for invoking the virtual functions or for testing the union tags.
        Which is the better of these 2 variants depends on the application. In high-performance computations, one does not use ambiguous data types, so normally none of these 2 is used. There are a few object-oriented programming languages where "everything is an object", i.e. any kind of data includes a virtual table pointer, but those are just incomplete programming languages, which do not have all the data types needed in practice, like also many early programming languages that had a unique data type, e.g. the original LISP I, which had only linked lists and no arrays, etc. C++ at least is a complete language, in which any kind of data type can be implemented, without overheads.
        As you said previously, C has few restrictions in what it can do, so in theory it is almost always possible to write a C program almost exactly equivalent with any program written in another language, matching its speed, even if that may require a significant reorganization of the code, not a line to line translation.
        Nevertheless, as the other poster said, the effort needed to write that equivalent program may be so high that it is not a realistic solution.
        So in practice it is not unusual that at similar programming efforts a higher-level language like C++ frequently allows writing a faster program than C.
        [-]
        flohofwoe 4 hours ago
        > while in the C language structures with virtual function tables that are accessed explicitly by the programmer are in widespread use wherever suitable, for instance in many popular open-source C programs, like the Linux kernel or the debugger gdb
        For dynamic dispatch there is absolutely no difference between using a jump table in C and virtual method tables in C++. If the compiler can infer the target address at compile time, it will not go through an indirect call, e.g.:
        https://www.godbolt.org/z/as8ehGhv3
        And for 'static dispatch' there's no difference between a C++ method call and a direct C function call (since for static dispatch the caller needs to 'know' the target function either way).
      - leonidasrup 5 hours ago
        For example, you can do loop unrolling using C++ template meta-programming.
        https://cpplove.blogspot.com/2012/07/a-generic-loop-unroller...
        Of course, nothing beats hand written ffmpeg-style assembly which takes into account optimal register allocation, instruction scheduling, cache alignment, etc. for specific processor architectures.
        [-]
        jeffreygoesto 4 hours ago
        Careful. That article is from 2012 and compile time unrolling was more useful back then. Today or can actually be harmful as it hides strong hints about the loop from the optimizer. Our code that did this fared worse than a loop, because no optimizer-writer expected unrolled loops.
      - nnevatie an hour ago
        Yes, to a degree. For example, if you look into Eigen, the math library, you'll notice that it's mostly header-only and heavily templated. Writing all that by hand in C would be possible, but incredibly time-consuming unless you'd rely on some pretty incredible macro-magistry.
      - swiftcoder 2 hours ago
        > So youre saying the metaprogramming facilities of C++ allow the compiler to better optimise high level human readable code more effectively than C.
        The metaprogramming facilities of C++ allow the programmer to more effectively optimise than they would have the patience to do in C.
        The compiler's own optimisations don't directly benefit from the metaprogramming facilities in this sense. What they do is let the programmer break high level constructs down to codegen that the compiler can optimise
        And you could do the same things by hand in C or Rust, but it would be tedious in the extreme, and you'd probably find yourself adopting external codegen tools
  - amelius 3 hours ago
    > I find this hard to believe. C++ does not have any construct that cannot be replicated, or is not common, in C.
    But this is not a valid argument, as all languages are Turing complete, and most modern languages can do low level stuff at optimum speeds. As an extreme example, in Java, you could just allocate a large chunk of memory and run an allocator inside of it and sidestep the GC entirely.
    With a programming language the question is thus not what can you do with it and how fast can it run with infinite effort, but what are the ergonomics, and what performance will you get in practice.
  - dwaite 3 hours ago
    > Is C++ more performant than C? I find this hard to believe.
    At the compiler level, no. But as you write projects, you will for instance run into things you can do with templates which are infeasible to attempt with macros.
    One example might be qsort() - a C compiler _could_ catch cases where it could create an intrinsic qsort based on the data type and function pointer being passed. However, in C++ you have the facilities to create a type safe, genericized sort that will be inlined based on the data structure.
  - loeg 8 hours ago
    C++ you get templated generic algorithms that in practice no one really does with C because macros suck too much. So in C typically you'd have a runtime generic routine that doesn't inline. A classic example here is qsort() vs std::sort().
    [-]
    - nicoburns 29 minutes ago
      Rust also has these advantages of course
    - flohofwoe 5 hours ago
      > So in C typically you'd have a runtime generic routine that doesn't inline.
      With LTO you get many of the same advantages as C++ template code, there's nothing magic about C++ template optimizations, it's all about whether the compiler can see all function bodies in a call hierarchy.
      [-]
      - simonask 3 hours ago
        LTO cannot change the layout of structs. For something like a hash map implementation, it matters whether inner nodes store a pointer to the key and value, or whether it stores a pointer to each. To achieve this in C, you have no other options than emulating templates using macros.
        [-]
        flohofwoe 3 hours ago
        The question is whether a hash-map implementation that works on a general `[key, index]` item and where index references at separate array of values isn't actually better for some access patterns ;)
        And of course the other alternative to macros is code-generation (but macros are actually often fine).
        But this also only matters for actually reusable generic code. If I'd implement a super-hot-path hashmap in C, I would stamp out a specialized version by hand instead of relying on a generic implementation. But for 90% of cases, a solution like in stb_ds.h is probably good enough.
    - afdbcreid 7 hours ago
      I explicitly acknowledged that:
      > The only candidate is using virtualization and void* pointers instead of monomorphized generics which some C code does for the lack of better options, but that's not a problem in Rust as well.
      But in fact, if speed is a concern to you, even in C you will use "templated" sorting (via macros or code generation).
      [-]
      - 20k 7 hours ago
        The problem is that the implementation burden with C is so high, that people tend not to do it even in relatively performance constrained situations
      - loeg 5 hours ago
        > in practice no one really does with C because macros [and codegen] suck too much
      - fluffybucktsnek 6 hours ago
        Neither codegen nor macros (they are a part of the preprocessor) are really a part of C.
        For the latter, the lack of integration becomes more noticeable if you try writing a macro in which the compare param can accept a function identifier. As the preprocessor doesn't have the knowledge of the contents of the referred function, it can't inline it. In C++ and Rust, their compilers do, so they can.
        A codegen tool could overcome this, but you could also make a codegen tool to write Zig/D/C#/Swift in C, or any other language for that matter :). By this point, one could say you are programming in a superset of C, not strict C.
  - smallstepforman 7 hours ago
    c++ uses rich type system to avoid aliasing when it can, as well as template meta programming.
    Eg: delete_scene(void *arg) vs delete_scene<T>(T *arg)
  - fithisux 7 hours ago
    You can write C style C++ and enjoy the same benefits.
    In Twitter a user explained me that it is common in embedded space.
    You do not need the OOP, RTTI, exceptions.
    Like C with most use cases of preprocessor replaced by generic programming.
    [-]
    - afdbcreid 7 hours ago
      So? How is that an argument that C++ is more performant than C? It's only an argument that it's not less performant.
gobdovan 8 hours ago
Rust is in an awkward position of being already complicated enough that adding proofs for skipping bounds checks probably will not happen for a long time, even though this kind of low-level operation is where a lot of optimisation is lost.
Compounding on this, Rust is also unstable underneath, since there is no public, stable contract for carrying high-level semantics from HIR into MIR. Because these high-level invariants are lost during compilation, the compiler cannot easily use them to prove and eliminate low-level safety checks. But even if the frontend was perfect, Rust relies on LLVM's language-neutral SCEV, which operates purely on low-level math and cannot reason about high-level language semantics.
Ultimately, a lot of things would need to change for Rust to pay no performance for safety features.
[-]
- aw1621107 7 hours ago
  > Compounding on this, Rust is also unstable underneath, since there is no public, stable contract for carrying high-level semantics from HIR into MIR. Because these high-level invariants are lost during compilation
  Not sure if I'm just out of the loop, but I'm having a hard time following this line of reasoning. Why is a public and/or stable contract needed to carry high-level semantics from HIR to MIR? Neither seems necessary to me; from what I understand HIR and MIR are rustc-internal so public contracts shouldn't matter, and the lack of stability means the Rust devs aren't precluded by backwards compatibility from modifying the IRs to add the ability to carry such invariants.
  [-]
  - gobdovan 6 hours ago
    Whoops! Although there is no public contract between HIR and MIR, the public part was not relevant here. What I wanted to highlight is that if they'd want to add proper proof machinery to eliminate low-level safety checks, they'd have to do it at: surface language, which is already complex enough; then HIR->MIR boundary with clean provenance (which I think current MIR would collapse too aggressively) and which may require a much clearer contract; then, even if they do the full front- and mid- ends properly, if you leave it up to LLVM, it ends up in SCEV, which is language neutral and would not be a good fit to support the proof obligations that would be specific to Rust.
    I dug up a proposal from 2021 around bounds check hoisting in MIR, and from the discussion, details are pretty thorny [0]. It's narrower than general proofs but the frictions are very similar. The easiest example that shows HIR -> MIR difficulties is the part around `for i in 0..32 { a[i] = 1; }`. At the source level the range fact is super obvious, but after the for-loop/iterator lowering the MIR optimiser has to recover that `i` comes from exactly that range before it can turn 32 checks into the one hot-path check. Then it also would have to check for panic strategy to maintain the correct behaviour after optimisation.
    [0] https://github.com/rust-lang/rust/issues/92327
    [-]
    - nicoburns 25 minutes ago
      Of course you can write the above as:
      a[0..32].iter_mut().for_each(|el| *el = 1)
      and have per-iteration bounds checks elided in Rust today.
    - aw1621107 2 hours ago
      OK, I think that makes more sense. Thanks for taking the time to explain!
- afdbcreid 8 hours ago
  The overhead of bounds checking varies a lot. In the common case it's negligible (few percents), but in some cases, depending on what you build, it can go up to even 20%. And if it prevents autovectorization it can cost even more.
  There are techniques to minimize the perf loss, though (safely), and of course you can use unsafe code. If you do it smartly, in the vast majority of cases bound checks do not matter (in fact, even in C++ there is a push for a hardened standard library that does bound checks, and e.g. Google uses that).
  Rust will never include full proofs, but it might include ranged integers which can minimize bound checks even more.
jarym 2 hours ago
I've been doing more and more Rust. Even with sscache the compile times are not great so for any moderately sized codebase that requires frequent rebuilds I don't know how everyone else is doing it
[-]
- wongarsu an hour ago
  I'd assume mostly by avoiding the need for frequent rebuilds. Incremental builds are pretty fast (at least fast enough for my needs on a moderate codebase), full rebuilds can be brutal
  There are also some optimization tricks related to how you split your code among crates, since a unit of compilation is mostly one crate. Putting your FFI code in a separate crate (-sys crates are the norm) and splitting some of your code in libraries that can be compiled in parallel are the common examples
- unsolved73 an hour ago
  the linking of the project can take more time than actual compilation.
  Use the lld linker instead of the default one, see https://kerkour.com/rust-production-checklist#use-the-lld-li...
Animats 8 hours ago
There's a discussion of "delayed bounds checking", but not "hoisted bounds checking", where bounds checking is done early. Consider
```
    let mut tab: [usize;100] = [0;100];
    ...
    for i in 0..101 {
        tab[i] = i;
    }
```
This must panic at i=100. Panic becomes inevitable at entry to the loop. Is the compiler entitled to generate a check that will panic at loop entry? The slides suggest that Rust does not hoist such checks, and, so, with nested loops, it has trouble getting checks out of the loop, which prevents vectorization.
[-]
- simonask 3 hours ago
  Panics in Rust do not currently time-travel like that (including panics from failed bounds checks), and that's a good thing. The reason is that panicking does not imply terminating the process - they can be caught and handled, just like exceptions in C++. In fact, they use the same stack unwinding mechanism by default.
  What the compiler is allowed to do is to shorten the loop by one and unconditionally panic after the loop, but this falls under the purview of the LLVM optimizer.
  [-]
  - jojomodding 3 hours ago
    Once it shortens the loop, the compiler can also observe that `tab` is a local variable and therefore move the writes up "to the initializer." It can then see that the variable is unused and delete it, and also delete the loop.
  - edevrk an hour ago
    > Panics in Rust do not currently time-travel like that (including panics from failed bounds checks), and that's a good thing.
    Is it a "good thing" in all ways? Does it make Rust slower than C++ in some cases? And if Rust wants to catch up to C++ in regards to performance, do you believe that Rust people like yourself will have to convince C++ language people to remove time travel optimizations? Not very nice or honorable of you. Why not focus on making Rust faster instead?
    Edit:
    Why does kibwen sound like a cheap whore putting on airs? As if, kibwen couldn't be honest or competent even if its (pronoun: thing) life depended on it.
    [-]
    - nicoburns 18 minutes ago
      A little slower but safe is a pretty good default I think. Most of the time you're not in a hot loop and even a 5% slowdown would be negligable.
      And in the cases where you are in a hot loop you just have to put in a little extra effort to optimise it and gain the performance back, either by writing the code in a way that allows the compiler to prove correctness (e.g. using an iterator or assert), or by using the unsage keyword to "pinky-promise" to the compiler that your usage is correct.
      IME that extra effort in performance-critical places almost always ends up being a lot less than the effort needed to avoid correctness/safety issues in mundane boilerplate/glue/plumbing code in C++.
      Especially as Rust's package management system means that often you don't even have to do that optimisation work yourself: you can just pull in a crate that's done it for you (and Rust's safety guarantees make that a much less scary thing to do than it is in C++)
    - kibwen an hour ago
      C++'s experience has caused Rust to rightfully learn the lesson that you don't allow optimizations to change the semantics of the program like that. Rust's goal is to be fast enough that any performance difference between C or C++ is too negligible to bother considering, and it's achieved that. It's not going to sacrifice reliability on the altar just to make up a measly 3% gap. There are plenty of ways that Rust's stricter semantics allow it to produce faster code than C++ (no move constructors or implicit copy constructors, thorough reference aliasing information, automatic generic struct layout optimizations, safe non-atomic refcounting, safe concurrent stack references, less defensive copying, etc.), it does not need to "convince C++ language people" of anything.
      [-]
      - edevrk 25 minutes ago
        Why do you sound like a cheap whore putting on airs? As if, you couldn't be honest or competent even if your life depended on it.
- afdbcreid 8 hours ago
  Currently LLVM cannot do that because the panic message includes the erroneous index. You can do it manually though if you add `_ = tab[100]`.
  Even if the panic message would not include the index, LLVM was unable to do that if the previous iterations had side effects (for example if `tab` is not a local variable).
encodedrose 10 hours ago
If I followed, Rust's memory safety guarantee means sacrificing roughly ~3% performance with some worst case paths being ~15% (compared to C++ performance)?
[-]
- marcosdumay 9 hours ago
  That's on the typical performance for bounds checking in C too.
  But no, "memory safety" includes most of the things discussed on the slides, and those number are for bounds checking only.
  [-]
  - encodedrose 7 hours ago
    Ah, I was using GH's webui instead of downloading to view the PDF and it stopped loading at slide#47...rereading it now paints a much better picture. Thanks!
Panzerschrek 5 hours ago
For a couple of years I have written an advanced software rasterizer (like in old PC games) using Rust. With a little bit of unsafe code it was doable and result performance was great. I only used unsafe in places mentioned in the article above, like in tight loops where the compiler's optimizer struggles to remove bounds checks and in a couple of places where CPU intrinsics were used.
suis_siva 5 hours ago
I worked professionally with C, C++, Zig and Rust (in that order). My experience is that writing performant code is by far the easiest in C++, and by far the most difficult in C. Most of this, in practical experience, is due to ergonomics, in my opinion.
Templates in C++ benefit from being part of the core language, -- stick a `template` above your `class`, and you're in metaprogramming land. Stick a template specialization, and you've done a niche optimization. You didn't need a separate crate or a whole macro DSL. Variadic templates are also really really nice for monomorphizing N-ary generic functions. The duck typing of templates makes
This is precisely where I struggle with Rust the most -- monomorphization is limited within generics, so you end up going to the `proc_macro` hell, which involves a separate crate, a separate Cargo.toml, etc.
Zig seems like it would fit the bill -- and doing micro-optimizations within zig is surprisingly easy. The language's comptime facilities allow for really good niche optimizations -- however, the language also has some strange decisions. The allocator interface is notoriously a vtable, so a lot of the DOD optimizations that andrewrk has spoken numerously of (and to be clear -- I did learn a lot about DOD from his talks back when I was a wee engineer), raise one of my eyebrows.
C seems like it should be fast, but implementing any data structure, any generic algorithm in C is impossible. Either you're copy-pasting, or you're making macro DSLs. None of which is great.
---
To further talk about the C++ situation -- the monomorphic allocator interface was always awesome. Compared to Zig's vtables and Rust's nothing (up until a couple days ago), having a way to pass custom allocators with types was awesome. The new std::pmr::* interfaces and containers are also really exciting -- monomorphization, as beautiful as it is, does cost a lot -- refactoring it is not easy, compilation times are a mess. Sometimes the right tool is a vtable interface, and, C++ gives you those facilities.
And this is C++'s no1 problem when it comes to performance too -- it's a leviathan -- it'll give you the tools to write REALLY fast code, but it will also give you inheritance -- forget about your caches then.
When I was working at Tesla, there were some pretty gnarly vtable jumps in firmware (of all places), and I suspect part of that could've been alleviated if people knew more about CRTP.
So, here's where I land -- C++ really will give you the tools it can to let you write the fastest code possible. But it will also give you the tools to make your code really slow. Committee language means everyone in the committee needs to be happy.
Rust, on the other hand, is really designed to promote safe-but-very-fast practices -- had the firmware that I discussed used Rust, my guess is that we would've gravitated towards generics and monomorphization, rather than the heavy dynamic inheritance. C++, when it comes to performance, as it does to all other things, is a barreled shotgun. Rust's design almost always promotes the best available pattern and that's why I rarely reach out for C++.
smasher164 5 hours ago
You end up needing something like refinement types to control the way you statically enforce bounds. That being said, there's stuff like https://flux-rs.github.io/flux/ which uses macros to layer a refinement type system on top of rust's. You can use it to statically eliminate bounds checks.
_alphageek 8 hours ago
I would have liked to have the checks-off delta plotted across rustc versions - the deck notes this stuff moves non-monotonically, so a trend line would say more than a single-version snapshot.
ozgrakkurt 3 hours ago
There is no performance of language, it is very dependent on the compiler in any language. I don’t think even clang/gcc can “fully” optimize c
DeathArrow 5 hours ago
I was looking at Zig. It's performant, it's easier to reason about Zig code than Rust code but its api is unstable, there are a lot of breaking changes. Coding agents have a difficult time write proper Zig because of the breaking changes and of the small amount of new Zig code in the wild.