Needless to say, in the 34 years since that article was published, a lot has changed. Thanks to massively increased transistor budgets, a more complex decoder with accompanying microcode ROM that might have been a big detriment in 1991 would today be a small speck of dust on the processor floor plan. At the same time, memory access performance hasn't increased to the same extent as compute performance, thus putting a relatively bigger emphasis on code density.
All this being said, RISC "won" in the sense that many RISC principles have become the "standard" principles of designing an ISA. Still, choosing "RISC purity" over code density is arguably the wrong choice. Contemporary high performance RISC architectures (ARMv9, say) are very un-RISC in the sense of having a zillion different instructions, somewhat complex addressing modes, and so forth.
Exactly. If you’re going to design a new ISA, you’d be foolish to make it a classical CISC design and would definitely choose RISC (e.g., RISC-V). But if you have a CISC ISA and you want to keep it running fast, then a virtually unlimited transistor budget allows you to create a sophisticated decoder that dispatches micro-ops to a RISC-like core and bridge the gap. That paper really take me back to working on PA-RISC designs at HP during that timeframe.
This seems to have its own issues and the proof is in the final cores.
ARM’s entire gross profits are about half of AMD’s R&D budget, but ARM cores have soundly beat x86 in IPC for years now (since around A78) and the most recent generations seem to be beating them in total performance, perf/watt, and core size.
We now have all three of the big ARM-based cores (Apple, ARM, and Qualcomm) beating x86. Apple you could maybe write off as unlimited money, but all three isn’t just coincidence.
If that weren’t enough, ARM designers are releasing new cores every year instead of every other year meaning they are doing around twice as many layout and validations despite the massively lower budget.
Before I get the “ARM only makes cores” excuses, I’d note that ARM announced that they’ve been working on their own server chips and that work is obviously having to fit within their same (comparatively tiny) budget.
It seems fairly obvious. Spring legacy garbage drives up complexity and cost (eg, ARM reduced A-series decoder size by 75% when they dropped 32-bit mode which was still way less complex than x86). This complexity drives up development time and cost. It also drives up validation cost and time.
There’s also a physical cost. Large, high frequency uop caches and cache controllers are better than just decoders on x86, but worse than not needing them at all is better still. Likewise, you hear crazy stuff like the x86 overly-strict memory model not mattering because you can speculate it away. That speculation means more complexity, more power, and more area.
Once you’re done with enough of these work-around, you get a chip that is technically as fast, but it cost more to design, costs more to validate, costs more to fab, costs more to buy, costs more to operate, and carries an opportunity cost from taking so much longer to get to market.
> you’d be foolish to make it a classical CISC design and would definitely choose RISC
I think that's arguable, honestly. Or if not it hinges on quibbling over "classic".
There is a lot of code density advantage to special-case CISCy instructions: multiply+add and multiplex are obvious one in the compute world, which need to take three inputs and don't fit within classic ALU pipeline models. (You either need to wire 50% more register wires into every ALU and add an extra read port for the register store, or have a decode stage that recognizes the "special" instructions to route them to a special unit -- very "Complex" for a RISC instruction).
But also just x86 CALL/RET, which combine arithmetic on the stack pointer, computation of a return address and store/load of the results to memory, are a big win (well, where not disallowed due to spectre/meltdown mitigations). ARM32 has its ldm/stm instructions which are big code size advantages too. Hardware-managed stack management a-la SPARC and ia64 was also a win for similar reasons, and still exists in a few areas (Xtensa has a similar register window design and is a dominant player in the DSP space).
The idea of making all access to registers and memory be cleanly symmetric is obviously good for a very constrained chip (and its designers). But actual code in the real world makes very asymmetric use of those resources to conform to oddball but very common use cases like "C function call" or "Matrix Inversion" and aiming your hardware at that isn't necessarily "bad" design either.
I’m talking about a VAX-like system with large instructions, microcode-based, etc. In the same way that CISC adapted since 1990, RISC has also adapted to add “complex” instructions where they are justified (e.g. SIMD/vector, crypto, hashing, more fancy addressing modes, acceleration for tensor processing, etc.). Nothing is a pure play anymore, but I’d still argue that new designs are better off starting on the RISC side of the (now very muddled) line, rather than the CISC side.
Right, that's "quibbling over 'classic'". You said "you'd be foolish to design CISC" meaning the hardware design paradigm of the late 1970's. I (and probably others) took it to mean the instruction set. Your definition would make a Zen5 or Arrow Lake box "RISC", which seems more confusing than helpful.
Well, you would be foolish to design the x86 now if you had a choice in the matter. Zen5 is a RISC at heart, with a really sophisticated decoder wrapped around it. Nobody uses x86 or keeps it moving forward because it’s the best instruction set architecture. You do it because it runs all the old code and it’s still fast, if a bit power hungry. BTW, ditto with IBM Z-series.
It’s obvious that if you’re specifically designing a clean sheet ISA, you wouldn’t choose to copy an existing ISA that has seen several decades of backwards compatible accumulated instructions (i.e. x86), but would rather opt for a clean sheet design.
That says nothing about whether you should opt for something that is more similar in nature to classic RISC or classic CISC.
OK, fair enough that one doesn’t necessarily imply the other, but if you were designing a CPU/ISA today, would you start with a CISC design? If so, why?
I do wonder if CISC might make a comeback. One is cache efficiency.
If you are doing many-core (NOT multi-core) than RISC make obvious sense.
If you are doing in-order pipelined execution, than RISC makes-sense.
But if you are doing superscalar out-of-order where you have multiple execution units and you crack instructions anyway, why not have CISC so that the you have more micro-ops to reorder and optimise? It seems like it would give the schedulers more flexibility to keep the pipelines fed.
With most infrastructure now open-source I think the penalty for introducing a new ISA is a lot less burdensome. If you port LLVM, GCC, and the JVM and most businesses could use it in production immediately without needing emulation that helped doom the Itanium.
I wonder if anyone has actually measured what the code size savings from this look like for typical programs, that would be an interesting read.
RISC trope is to expose a "link register" and expect the programmer to manage storage for a return address, but if call/ret manage this for you auto-magically you're at least saving some space whenever dealing with non-leaf functions.
A typical CALL is a 16 bit displacement and encodes in three bytes. A RET encodes in one.
On arm64, all instructions are four bytes. The BL and BX to effect the branching is 8 bytes of instruction already. Plus non-leaf functions need to push and pop the return address via some means (which generally depends on what the surrounding code is doing, so isn't a fixed cost).
Obviously making that work requires not just the parallel dispatch for all the individual bits, but a stack engine in front of the cache that can remember what it was doing. Not free. But it's 100% a big win in cache footprint.
Yeah totally. It's really easy to forget about the fact that x86 is abstracting a lot of stack operations away from you (and obviously that's part of why it's a useful abstraction!).
> A typical CALL is a 16 bit displacement and encodes in three bytes. A RET encodes in one.
True for `ret`, I'm not convinced it's true for `call` on typical amd64 code. The vast majority I see are 5 bytes for a regular call, with a significant number of 6 bytes e.g. `call 0xa4b4b(%rip)` or 7 bytes if relative to a hi register. And a few 2 bytes if indirect via a lo register e.g. `call %rax` or 3 for e.g. `call *%r8`.
But mostly 5 bytes, while virtually all calls on arm64 and riscv64 are 4 bytes with an occasional call needing an extra `adrp` or `lui/auipc` to give ±2 GB range.
But in any case, it is indisputable that on average, for real-world programs, fixed-length 4 byte arm64 matches 1-15 byte variable-length amd64 in code density and both are significantly beaten by two length riscv64.
All you have to do to verify this is to just pop into the same OS and version e.g. Ubuntu 24.04 LTS on each ISA in Docker and run `size` on the contents of `/bin`, `/usr/bin` etc.
> All you have to do to verify this is to just pop into the same OS and version e.g. Ubuntu 24.04 LTS on each ISA in Docker and run `size` on the contents of `/bin`, `/usr/bin` etc.
(I cheated a bit and used the total size of the binary, as binutils isn't available out of the box in the ubuntu container. But it shouldn't be too different from text+bss+data.)
$ podman run --platform=linux/amd64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'
22629493
$ podman run --platform=linux/arm64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'
29173962
$ podman run --platform=linux/riscv64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'
22677127
One can see that amd64 and riscv64 are actually very close, with in fact a slight edge to amd64. Both are far ahead of arm64 though.
ARMv9 also has read-modify-write memory instructions, so does any usable RISC-V implementation. It turns out that LL-SC (which would avoid those) does not permit efficient implementations. (LL-SC does look like a rather desperate attempt to preserve a pure RISC register-register architecture.)
I get the impression people believe that instruction density does not matter much in practice (at least for large cores). For example, x86-64 compilers generally prefer the longer VEX encoding (even in contexts where it does not help to avoid moves or transition penalties), or do not implement passes to avoid redundant REX prefixes.
RISC is more about the time each instruction takes rather than how many instructions because consistent timing reduces bubbles and complexity. In this sense, RISC has completely won. Complexity of new instructions in the main pipeline is very restricted by this limitation and ISAs like x86 break down complex instructions into multiple small instructions before pushing them through.
ARMv9 still has very few instruction modes with far less complexity when you compare it with x86 or some other classic CISC ISA.
> memory access performance hasn't increased to the same extent as compute performance, thus putting a relatively bigger emphasis on code density.
The problem isn't RAM. The problem is that (generally speaking) cache is either big or fast. x86 was stuck at 32kb for a decade or so. Only recently have we seen larger 64kb caches included. Higher code density means more cache hits. This is the big reason to care about code density in modern CPUs.
RISC-V shows that you can remain RISC and still have great code density. Despite arguably making some bad/terrible decisions about the C instructions, RISC-V still generally beats x86 in code density by a significant margin (and growing as they add new instructions for some common cases).
Caches can be fast and very expensive (in area & power)! I have an HP PA-RISC 8900 with 768KB I&D caches. They are relatively fast and low latency, given the time-frame of their design. They also take up over half the die area.
I don't know how this has anything to do with what I said.
The original intent of uops in x86 was to break more complex instructions down into more simple instructions so the main pipeline wasn't super-variable.
If you look at new designs like M-series (or even x86 designs), they try very hard to ensure each instruction/uop retires in a uniform number of cycles (I've read[0] that even some division is done in just two cycles) to keep the pipeline busy and reduce the amount of timings that have to be tracked through the system. There are certainly instructions that take multiple cycles, but those are going to take the longer secondary pipelines and if there is a hard dependency, they will cause bubbles and stalls.
> thus putting a relatively bigger emphasis on code density.
> choosing "RISC purity" over code density is arguably the wrong choice
You appear to be under the incorrect impression that CISC code is more dense than RISC code.
This seems to be a common belief, apparently based on the idea that a highly variable-length ISA can be Huffman encoded, with more common operations being given shorter opcodes. This turns out not to be the case with any common CISC ISA. Rather, the simpler less flexible operations are given shorter opcodes, and that is a very different thing. A lot of the 8 bit instructions in x86 are wasted on operations that are seldom or never used and that could, even in 1976, have safely been hidden in some secondary code page.
The densest common 32 bit ISAs are Arm Thumb2 and RISC-V with the C extension. Both of them have two instruction lengths, 2 bytes and 4 bytes, as did many historical RISC or RISC-like machines including CDC6600 (15 bits and 30 bits), Cray 1, the first version of IBM 801, Berkeley RISC-II.
The idea that RISC means only a single instruction length is historically true only for ISAs introduced in the brief period between 1985 (Arm, SPARC, MIPS) and 1992 (Alpha) out of the 60 year span of RISC-like design (CDC6600 1964, the fastest supercomputer of its time). And, as an outlier, Arm64 (2011), which I think will come to be recognised as a mistake -- they thought Amd64 was the competition they had to match for code density (and they did) but failed to anticipate RISC-V.
In 64 bit, RISC-V is by far the densest ISA.
> Contemporary high performance RISC architectures (ARMv9, say) are very un-RISC in the sense of having a zillion different instructions, somewhat complex addressing modes, and so forth.
Yes, ARMv8/9-A is complex. However there is no evidence that it is higher performance than RISC-V in comparable µarches and process nodes. On the contrary, other than their lack of SIMD SiFive's U74 and P550 are faster than Arm's A53/A55 and A72, respectively. This appears to continue for more recent cores, but we don't yet have purchasable hardware to prove it. That should change in 2026, with at least Tenstorrent shipping RISC-V equivalent to Apple's M1.
ARM64 has a trick up its sleeve: many instructions that would be longer on other architecturea are instead split into easily recognisable pairs on ARM64. This allows for simple inplementations to pretend it's fixed length while more complex ones can pretend it's variable length. SVE takes this one step further with MOVPRFX, which can add be placed before almost all SVE instructions to supply masking and a third operand.
> All this being said, RISC "won" in the sense that many RISC principles have become the "standard" principles of designing an ISA.
I disagree. Maybe many RISC chip design ideas may have taken over, but only because there are massive transistor budgets. I'd like to see a RISC chip that actually has a basic instruction set. As in, not having media instructions, SIMD instructions, crypto primitives, etc. If anything, Moore's Law won and the RISC v CISC battle became meaningless and they can just spend transistors to make every instruction faster if they care to.
Pre-RISC CPU designs were pragmatic responses to the design constraints of their time. (expensive memory, poor compilers)
RISC was a pragmatic response to the design constraints of its time. (memory becomes less expensive, transistor budgets are tight, and compilers are a little better)
Post-RISC designs of today are, also, only pragmatic responses to the design constraints of today. Those constraints are different than they were in the 80s and 90s.
The supposed dichotomy is just utter horseshit. It was invented as a marketing campaign to sell CPUs. It was canonized by the most popular text books being written by /Team RISC/.
I wish, as an industry, we’d just get over it, move on, and stop talking about it so much.
Yep. And then we learned that transistors were almost free, could be purchased in lot quantities of 1 billion, and could be used to create a translation layer between a CISC instruction stream and a RISC core.
not actually a RISC core. µops are simpler than the externally visible instruction set but they are not RISC.
(There were early x86 implementations that really were an x86 decoder bolted onto a preexisting RISC design. They didn't really perform well.)
Turns out it's useful to have memory operands -- even read-modify-write operands -- in your µops. Turns out it's useful to have instructions that are wider than 32 bits. Turns out it's useful to have big literals, even if the immediate field has to be shared by a whole group of 3-4 µops (so only one of them can have a big literal). Flags are also not necessarily the problem RISC people said it was (and several RISCs actually do have flags despite the Computer Architecture 101 dogmas). Just having an reg+ofs address mode turns out to be a bad idea.
Indirect addressing à la PDP-11, VAX, and 68K (and many long forgotten architectures) turned out not to be a good idea, of course.
Sure, it’s not coded the same way as your external ISA does it (like you said, lots of wide information), but the core is optimized for a load/store architecture with ops that generally execute in a single pipeline stage, and the big, complex instructions in the ISA map to multiple smaller micro-ops. But yes, it is NOT literally an x86 decoder wrapped around a MIPS or ARM core.
No. NOT a load/store architecture. People tried that in the 90's for x86. Doesn't work nearly as well as keeping µops generally the same as the simpler move/ALU instructions of the external ISA, just encoded differently. That's what modern x86 uses, that's what modern z/Arch ("S/360-31-64++") uses.
x86 and z/Arch instructions for move/ALU/jmp/conditional branch are fine. They are not hard to decode and they are not hard to execute. That's the core of the µop instruction set (just encoded differently). Then they add some specialty stuff necessary to implement more complex instructions using sequences of µops -- and of course the SIMD stuff. That's it. It's a simpler version of the external ISA, NOT a RISC.
Part of the path the µops take is wide, possibly with the option of a shared field for a large immediate.
Instructions that map to short, fixed-length µops sequences are handled directly by the decoders that spit out a wide chunk of µops. Longer or variable-length µops are handled by having the decoders emit an index into a ROM of wide µops chunks. The decoders can often spit out both the first wide chunks AND the index (so there's no need to wait a cycle for the ROM to emit µops).
Multi-cycle µops are not much of problem, as long as the cycle count is predictable, preferably statically predictable. It is common to have µops that are "multi-issue", for example if they involve memory operands.
Hm. Maybe I’m using terms incorrectly or my model for what is happening is incorrect. Please educate me.
“NOT a load/store architecture.” What do you mean by this, exactly?
“It's a simpler version of the external ISA, NOT a RISC.” How are defining “RISC,” exactly? What makes it not a RISC given that you’re also saying it’s simpler?
Transistors aren’t free. x86 cores are all bigger than their ARM competitors even when on the same node while also getting worse performance per watt.
The translation layers cost time and money to build which could be spent making the rest of the chip faster. They stuck up extra die area and use power.
ARM’s total income in 2024 was half of AMDs R&D budget, but the core they finished that year can get x86 desktop levels of performance in a phone.
The idea that there’s no cost to x86 just doesn’t seem to hold up under even mild scrutiny.
Yes, that’s why I said almost free. But that said, as I understand it, the x86 decoders aren’t that much of the die area of a modern design. Most of it is L1 cache, GPUs, neural engines, etc., which makes simple die area comparisons of modern processors a bit useless for this particular question. You’re really comparing all that other stuff. To be clear, I’m squarely on the RISC side of any “debate,” but it was interesting to watch how the CISC designs evolved in the early 2000s to maintain their relevance.
ARM shed 75% of their decoder size when removing support for the 32-bit ISA in their A-series cores and it’s nowhere near as bad as x86.
A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder. That’s 22% of total power. Even if you say that’s the high-water mark and it’s usually half of that, it’s still a massive consideration.
If the decoder power and size weren’t an issue, they’d have moved past 4-wide decode years ago. Instead, golden cove only recently went 6 wide and power consumption went through the roof. Meanwhile, AMD went with a ludicrously complex double 4-wide decoder setup that limits throughput under some circumstances and creates all kinds of headaches and hazards that must be dealt with.
Nobody would do this unless the cost of scaling x86 decided was immense. Instead, they’d just slap on another decoder like ARM or RISC-V does.
> A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder.
You are mixing studies, the Haskell paper only looked at total core paper (and checked how it was impacted by algorithms). It was this [1] study that provided the 4.8w out of 22.1w number.
And while it's an interesting number, it says nothing about the overhead of decoding x86. They chose to label that component "instruction decoder" but really they were measuring the whole process of fetching instructions from anywhere other than the μop cache.
That 4.8w number includes the branch predictor, the L1 instruction cache, potentially TLB misses and instruction fetch from L2/L3. And depending on the details of their regression analysis, it might also include things after instruction decoding, like register renaming and move elimination. Given that they show the L1 data cache as using 3.8w, it's entirely possible that the L1 instruction cache actually makes up most of the 4.8w number.
So we really can't use this paper to say anything about the overhead of decoding x86, because that's not what it's measuring.
I’m not sure what point you’re trying to get me to concede. I’ve already stated that I’m on the RISC side of the debate. If your point is that it’s difficult to keep doing this with x86, I won’t argue with you. But that said, the x86 teams have kept up remarkably well so far. How far can they keep going? I don’t know. They’re already a lot further than everyone predicted they’d be a couple decades ago. Nearly free transistors, even if not fully free, are quite useful, it turns out.
Sure, they’re getting less free and they were never completely free in any case. But that’s a straw man that nobody was trying to argue. So (again), what’s your point?
This is not really true. For example, an Apple M1 has firestorm cores that are around 3.75mm^2, and a zen3 core from around the same era is just 3mm^2 (roughly).
I can make a silly 20 wide arm cpu and a silly 1 wide x86 cpu and the x86 will be smaller by a lot.
M1 core is 2.28mm2 and Zen3 core is 4.05mm2 if you count all the core-specific stuff (and 3.09mm2 even if you exclude the power regulation and the L2 cache that only this core can use). That is 27-45% larger for generally worse performance (and all-around worse performance if per-core power is constrained to something realistic). I'd also note that Oryon and C1-Ultra seem to be much more area efficient than more recent Apple cores.
We're at a point where Apple's E-cores are getting close to Zen or Cove cores in IPC while using just 0.6w at 2.6GHz.
If you count EVERYTHING outside of the core (power, matrix co-processor, last-level-cache, share logic, etc), and average out, we get 15.555mm2 / 4 = 3.89mm2 for M1.
If we do the same for Zen3 (excluding tests and infinity fabric), we have 67.85mm2 / 8 = 8.49mm2.
M1 has 12mb of last-level cache (coherent) or 3mb per core while Zen3 has 4mb of coherent L1 and 32mb of victim cache (used to buffer IO/RAM read/write and to hold cache misses in hopes that they can be used eventually). You can analyze this how you would like, but M1's cache design is more efficient and gives a higher hit rate despite being smaller. Chips and Cheese has an interesting writeup of this as applied to Golden Cove.
They called me out as being wrong then cited incorrect data to support their claim. I responded with the real numbers and sources to back them up. What else should be done?
I’m not sure I get your point? Are you saying it is impossible to accelerate the VAX instruction set via the same technique used on x86? If so, you’ll have to explain why. Now, whether you’d want to or not is another question.
Yes, it's not possible to do it the same way, because what made x86 successful in their application is that x86 was remarkably RISC-y in actual behaviour compared to 68k or VAX.
The main reason for that is that x86 code, outside of being register poor leading to lots of stack etc. use, decomposes most "ciscy" operations into LEA (inlined in pipeline) + memory access + actual operation.
VAX (and to lesser extent, m68k and others) had multiple indirections just to get the operands right, way more than what is essentially single LEA instruction. The most complex VAX instructions could have been ignored as "this is super slow one and rarely used", but the burden of handling the indirections remain, including possible huge memory latency costs.
RISC-V is the better sweet-spot, and has no strong IP locks like arm or x86_64. Not to mention the silicon of nowdays changes everything: you avoid silicon design complexity as much as possible since it will be more than performant enough for the bulk of the programs out there.
That discussion is five years old and a lot has happened in the RISC-V world since.
64-bit RISC-V finally achieved feature-parity with 64-bit ARM, with the RVA22 and RVA23 profiles. (There are no RVA23-compliant chips taped out yet, but several are expected early next year)
RISC-V's core was intentionally made small, to be useful for teaching and for allowing very tiny cores to be made for embedded systems. The extensibility has resulted in a few trade-offs that are different from ARM and can definitely be discussed, but the extensibility is also one of the RISC-V ecosystem's strengths: Embedded chips can contain just the extensions that are needed. Proprietary extensions are also allowed, which have been used to prototype and evaluate them when developing official extensions.
For any fair comparison between ARM and RISC-V, you should compare the right ARM ISA against the right RISC-V ISA.
ARM Cortex-M0 against RV32IC, ARMv9 AArch64 against RVA23, etc.
How do I do an "add with carry" (or subtract with carry/borrow) on RISC-V (for this, of course, the addition has to set a carry flag, and the subtraction either has to set a carry or borrow flag).
This feature is very important for a high-performant implementation of arbitrary-precision arithmetic.
Yes yes that's the other widely quoted criticism of RISC-V, from a GNU MP maintainer.
At the time there was no widely-available RISC-V hardware. There is now, and earlier this year I tested it.
It turns out that the GNU MP project's own benchmark runs better on SiFive RISC-V cores than on comparable µarch Arm cores, specifically better on U74 than on A53, and better on P550 than on A72, despite (obviously) the Arm cores having carry flags and ADC instructions and the RISC-V cores not having those.
The ensuing discussion also came to a consensus that once you get to very wide cores e.g. 8+ like Apple's M1 and several upcoming RISC-V cores, carry flags are ACTIVELY BAD because they serialise the computation (even with OoO, renaming the flags register etc) while only one 64 bit A + 64 bit B + 1 bit carry-in limb computation in 18,446,744,073,709,551,616 has an actual carry-out dependency on the carry-in, so you can almost always simply add all the limbs in parallel. The carry-in only affects carry-out when A+B is exactly 0xFFFFFFFFFFFFFFFF.
A quick Google search for the number of processors shipped in 2024 shows nVidia shipping one-billion riscv _cores_ last year being pretty big news. Assuming they were one fourth of the market, and that there was just one core per chip (certainly not true) that’s four billion cores.
Meanwhile arm licensees shipped twenty-nine billion chips. Most with four cores, and some with many more.
Riscv is on the way up, but any death sentence for arm is decades away.
Qualcomm is rumored to be paying 2-2.5% per chip instead of the 5-5.5% ARM sued them for (some sources claim it was 6-10%).
No matter how you slice up their handset and mobile divisions profits, the current cost is billions just to license an ISA.
Qualcomm could likely pay for most of their chip design costs with the amount of money they’re giving ARM. When contract renewal comes up, Qualcomm will be even more incentivized to move to RISC-V.
Apple has apparently been moving on chip soft cores to RISC-V. At some point (probably around renewal time), they’re likely to want to save money by switching to RISC-V too.
There are even rumors of ARM working on RISC-V cores (I’d guess smaller cores to compete with SiFive).
There’s an economic incentive to change and the only blocker is ARM inertia, but making ARM emulate quickly on RISC-V is almost certainly easier than making x86 and all its weirdness/edge cases run on ARM.
Once the software is in place (getting close) and a competitive RISC-V phone chip exists, I suspect the conversion will be very fast.
Most architectures never actually die. New architectures just snap up the balance of the new designs. That said, a competitive license-free ISA is quite attractive to a lot of people, and once the snowball gets moving it will grow in size and pick up speed.
> Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care.
This is taken straight from the ISA specification.
For example, Intel can brand their own chips as RISC-V-ZIntelx8664, if they want to, because RISC-V-Zxxx implementation can be as incompatible with any other RISC-V implementations (and even specification) as one wants.
>RISC-V-Zxxx implementation can be as incompatible with any other RISC-V implementations (and even specification) as one wants.
It can't. Extensions have a lot of freedom, but still have to follow the core spec. You can't just submit amd64 as a riscv extension for a myriad of reasons, it clearly conflicts with the base ISA.
This is also why there is a trademark. If you or Intel wanted to deliberately create nonsense to be disruptive, you are allowed to create it, but then you can't call it RISC-V without receiving some fan-mail from lawyers.
...although you could certainly implement a Jazelle-style "branch to x86" instruction if you really wanted. It's fine as long as you come up in RVI mode.
Unfortunately you would not be able to repurpose the LSB of jalr targets as an "x86 bit" à la Thumb because the standard requires you to ignore that bit, though of course you could enable that behaviour with a custom CSR.
Needs a [1991] tag.
Needless to say, in the 34 years since that article was published, a lot has changed. Thanks to massively increased transistor budgets, a more complex decoder with accompanying microcode ROM that might have been a big detriment in 1991 would today be a small speck of dust on the processor floor plan. At the same time, memory access performance hasn't increased to the same extent as compute performance, thus putting a relatively bigger emphasis on code density.
All this being said, RISC "won" in the sense that many RISC principles have become the "standard" principles of designing an ISA. Still, choosing "RISC purity" over code density is arguably the wrong choice. Contemporary high performance RISC architectures (ARMv9, say) are very un-RISC in the sense of having a zillion different instructions, somewhat complex addressing modes, and so forth.
Exactly. If you’re going to design a new ISA, you’d be foolish to make it a classical CISC design and would definitely choose RISC (e.g., RISC-V). But if you have a CISC ISA and you want to keep it running fast, then a virtually unlimited transistor budget allows you to create a sophisticated decoder that dispatches micro-ops to a RISC-like core and bridge the gap. That paper really take me back to working on PA-RISC designs at HP during that timeframe.
This seems to have its own issues and the proof is in the final cores.
ARM’s entire gross profits are about half of AMD’s R&D budget, but ARM cores have soundly beat x86 in IPC for years now (since around A78) and the most recent generations seem to be beating them in total performance, perf/watt, and core size.
We now have all three of the big ARM-based cores (Apple, ARM, and Qualcomm) beating x86. Apple you could maybe write off as unlimited money, but all three isn’t just coincidence.
If that weren’t enough, ARM designers are releasing new cores every year instead of every other year meaning they are doing around twice as many layout and validations despite the massively lower budget.
Before I get the “ARM only makes cores” excuses, I’d note that ARM announced that they’ve been working on their own server chips and that work is obviously having to fit within their same (comparatively tiny) budget.
It seems fairly obvious. Spring legacy garbage drives up complexity and cost (eg, ARM reduced A-series decoder size by 75% when they dropped 32-bit mode which was still way less complex than x86). This complexity drives up development time and cost. It also drives up validation cost and time.
There’s also a physical cost. Large, high frequency uop caches and cache controllers are better than just decoders on x86, but worse than not needing them at all is better still. Likewise, you hear crazy stuff like the x86 overly-strict memory model not mattering because you can speculate it away. That speculation means more complexity, more power, and more area.
Once you’re done with enough of these work-around, you get a chip that is technically as fast, but it cost more to design, costs more to validate, costs more to fab, costs more to buy, costs more to operate, and carries an opportunity cost from taking so much longer to get to market.
> ARM reduced A-series decoder size by 75% when they dropped 32-bit mode
Interesting. Got a source for that?
https://fuse.wikichip.org/news/6853/arm-introduces-the-corte...
It was one of the biggest features of A715 going by ARM’s slides.
I think you’re just making my point.
> you’d be foolish to make it a classical CISC design and would definitely choose RISC
I think that's arguable, honestly. Or if not it hinges on quibbling over "classic".
There is a lot of code density advantage to special-case CISCy instructions: multiply+add and multiplex are obvious one in the compute world, which need to take three inputs and don't fit within classic ALU pipeline models. (You either need to wire 50% more register wires into every ALU and add an extra read port for the register store, or have a decode stage that recognizes the "special" instructions to route them to a special unit -- very "Complex" for a RISC instruction).
But also just x86 CALL/RET, which combine arithmetic on the stack pointer, computation of a return address and store/load of the results to memory, are a big win (well, where not disallowed due to spectre/meltdown mitigations). ARM32 has its ldm/stm instructions which are big code size advantages too. Hardware-managed stack management a-la SPARC and ia64 was also a win for similar reasons, and still exists in a few areas (Xtensa has a similar register window design and is a dominant player in the DSP space).
The idea of making all access to registers and memory be cleanly symmetric is obviously good for a very constrained chip (and its designers). But actual code in the real world makes very asymmetric use of those resources to conform to oddball but very common use cases like "C function call" or "Matrix Inversion" and aiming your hardware at that isn't necessarily "bad" design either.
I’m talking about a VAX-like system with large instructions, microcode-based, etc. In the same way that CISC adapted since 1990, RISC has also adapted to add “complex” instructions where they are justified (e.g. SIMD/vector, crypto, hashing, more fancy addressing modes, acceleration for tensor processing, etc.). Nothing is a pure play anymore, but I’d still argue that new designs are better off starting on the RISC side of the (now very muddled) line, rather than the CISC side.
Right, that's "quibbling over 'classic'". You said "you'd be foolish to design CISC" meaning the hardware design paradigm of the late 1970's. I (and probably others) took it to mean the instruction set. Your definition would make a Zen5 or Arrow Lake box "RISC", which seems more confusing than helpful.
Well, you would be foolish to design the x86 now if you had a choice in the matter. Zen5 is a RISC at heart, with a really sophisticated decoder wrapped around it. Nobody uses x86 or keeps it moving forward because it’s the best instruction set architecture. You do it because it runs all the old code and it’s still fast, if a bit power hungry. BTW, ditto with IBM Z-series.
It’s obvious that if you’re specifically designing a clean sheet ISA, you wouldn’t choose to copy an existing ISA that has seen several decades of backwards compatible accumulated instructions (i.e. x86), but would rather opt for a clean sheet design.
That says nothing about whether you should opt for something that is more similar in nature to classic RISC or classic CISC.
OK, fair enough that one doesn’t necessarily imply the other, but if you were designing a CPU/ISA today, would you start with a CISC design? If so, why?
I do wonder if CISC might make a comeback. One is cache efficiency.
If you are doing many-core (NOT multi-core) than RISC make obvious sense.
If you are doing in-order pipelined execution, than RISC makes-sense.
But if you are doing superscalar out-of-order where you have multiple execution units and you crack instructions anyway, why not have CISC so that the you have more micro-ops to reorder and optimise? It seems like it would give the schedulers more flexibility to keep the pipelines fed.
With most infrastructure now open-source I think the penalty for introducing a new ISA is a lot less burdensome. If you port LLVM, GCC, and the JVM and most businesses could use it in production immediately without needing emulation that helped doom the Itanium.
> x86 CALL/RET
I wonder if anyone has actually measured what the code size savings from this look like for typical programs, that would be an interesting read. RISC trope is to expose a "link register" and expect the programmer to manage storage for a return address, but if call/ret manage this for you auto-magically you're at least saving some space whenever dealing with non-leaf functions.
A typical CALL is a 16 bit displacement and encodes in three bytes. A RET encodes in one.
On arm64, all instructions are four bytes. The BL and BX to effect the branching is 8 bytes of instruction already. Plus non-leaf functions need to push and pop the return address via some means (which generally depends on what the surrounding code is doing, so isn't a fixed cost).
Obviously making that work requires not just the parallel dispatch for all the individual bits, but a stack engine in front of the cache that can remember what it was doing. Not free. But it's 100% a big win in cache footprint.
Yeah totally. It's really easy to forget about the fact that x86 is abstracting a lot of stack operations away from you (and obviously that's part of why it's a useful abstraction!).
> A typical CALL is a 16 bit displacement and encodes in three bytes. A RET encodes in one.
True for `ret`, I'm not convinced it's true for `call` on typical amd64 code. The vast majority I see are 5 bytes for a regular call, with a significant number of 6 bytes e.g. `call 0xa4b4b(%rip)` or 7 bytes if relative to a hi register. And a few 2 bytes if indirect via a lo register e.g. `call %rax` or 3 for e.g. `call *%r8`.
But mostly 5 bytes, while virtually all calls on arm64 and riscv64 are 4 bytes with an occasional call needing an extra `adrp` or `lui/auipc` to give ±2 GB range.
But in any case, it is indisputable that on average, for real-world programs, fixed-length 4 byte arm64 matches 1-15 byte variable-length amd64 in code density and both are significantly beaten by two length riscv64.
All you have to do to verify this is to just pop into the same OS and version e.g. Ubuntu 24.04 LTS on each ISA in Docker and run `size` on the contents of `/bin`, `/usr/bin` etc.
> All you have to do to verify this is to just pop into the same OS and version e.g. Ubuntu 24.04 LTS on each ISA in Docker and run `size` on the contents of `/bin`, `/usr/bin` etc.
(I cheated a bit and used the total size of the binary, as binutils isn't available out of the box in the ubuntu container. But it shouldn't be too different from text+bss+data.)
$ podman run --platform=linux/amd64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'
22629493
$ podman run --platform=linux/arm64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'
29173962
$ podman run --platform=linux/riscv64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'
22677127
One can see that amd64 and riscv64 are actually very close, with in fact a slight edge to amd64. Both are far ahead of arm64 though.
ARMv9 also has read-modify-write memory instructions, so does any usable RISC-V implementation. It turns out that LL-SC (which would avoid those) does not permit efficient implementations. (LL-SC does look like a rather desperate attempt to preserve a pure RISC register-register architecture.)
I get the impression people believe that instruction density does not matter much in practice (at least for large cores). For example, x86-64 compilers generally prefer the longer VEX encoding (even in contexts where it does not help to avoid moves or transition penalties), or do not implement passes to avoid redundant REX prefixes.
LL/SC is performant, it just doesn't scale to high core counts.
The VEX encoding is actually only rarely longer than the legacy one, and frequently it is shorter.
RISC is more about the time each instruction takes rather than how many instructions because consistent timing reduces bubbles and complexity. In this sense, RISC has completely won. Complexity of new instructions in the main pipeline is very restricted by this limitation and ISAs like x86 break down complex instructions into multiple small instructions before pushing them through.
ARMv9 still has very few instruction modes with far less complexity when you compare it with x86 or some other classic CISC ISA.
> memory access performance hasn't increased to the same extent as compute performance, thus putting a relatively bigger emphasis on code density.
The problem isn't RAM. The problem is that (generally speaking) cache is either big or fast. x86 was stuck at 32kb for a decade or so. Only recently have we seen larger 64kb caches included. Higher code density means more cache hits. This is the big reason to care about code density in modern CPUs.
RISC-V shows that you can remain RISC and still have great code density. Despite arguably making some bad/terrible decisions about the C instructions, RISC-V still generally beats x86 in code density by a significant margin (and growing as they add new instructions for some common cases).
Not really. A RISC design can have very complex timing and pipelines, with instructions converted to uOPs (and fused to uOPs!) just like X86: https://dougallj.github.io/applecpu/firestorm.html
Caches can be fast and very expensive (in area & power)! I have an HP PA-RISC 8900 with 768KB I&D caches. They are relatively fast and low latency, given the time-frame of their design. They also take up over half the die area.
I don't know how this has anything to do with what I said.
The original intent of uops in x86 was to break more complex instructions down into more simple instructions so the main pipeline wasn't super-variable.
If you look at new designs like M-series (or even x86 designs), they try very hard to ensure each instruction/uop retires in a uniform number of cycles (I've read[0] that even some division is done in just two cycles) to keep the pipeline busy and reduce the amount of timings that have to be tracked through the system. There are certainly instructions that take multiple cycles, but those are going to take the longer secondary pipelines and if there is a hard dependency, they will cause bubbles and stalls.
> thus putting a relatively bigger emphasis on code density.
> choosing "RISC purity" over code density is arguably the wrong choice
You appear to be under the incorrect impression that CISC code is more dense than RISC code.
This seems to be a common belief, apparently based on the idea that a highly variable-length ISA can be Huffman encoded, with more common operations being given shorter opcodes. This turns out not to be the case with any common CISC ISA. Rather, the simpler less flexible operations are given shorter opcodes, and that is a very different thing. A lot of the 8 bit instructions in x86 are wasted on operations that are seldom or never used and that could, even in 1976, have safely been hidden in some secondary code page.
The densest common 32 bit ISAs are Arm Thumb2 and RISC-V with the C extension. Both of them have two instruction lengths, 2 bytes and 4 bytes, as did many historical RISC or RISC-like machines including CDC6600 (15 bits and 30 bits), Cray 1, the first version of IBM 801, Berkeley RISC-II.
The idea that RISC means only a single instruction length is historically true only for ISAs introduced in the brief period between 1985 (Arm, SPARC, MIPS) and 1992 (Alpha) out of the 60 year span of RISC-like design (CDC6600 1964, the fastest supercomputer of its time). And, as an outlier, Arm64 (2011), which I think will come to be recognised as a mistake -- they thought Amd64 was the competition they had to match for code density (and they did) but failed to anticipate RISC-V.
In 64 bit, RISC-V is by far the densest ISA.
> Contemporary high performance RISC architectures (ARMv9, say) are very un-RISC in the sense of having a zillion different instructions, somewhat complex addressing modes, and so forth.
Yes, ARMv8/9-A is complex. However there is no evidence that it is higher performance than RISC-V in comparable µarches and process nodes. On the contrary, other than their lack of SIMD SiFive's U74 and P550 are faster than Arm's A53/A55 and A72, respectively. This appears to continue for more recent cores, but we don't yet have purchasable hardware to prove it. That should change in 2026, with at least Tenstorrent shipping RISC-V equivalent to Apple's M1.
ARM64 has a trick up its sleeve: many instructions that would be longer on other architecturea are instead split into easily recognisable pairs on ARM64. This allows for simple inplementations to pretend it's fixed length while more complex ones can pretend it's variable length. SVE takes this one step further with MOVPRFX, which can add be placed before almost all SVE instructions to supply masking and a third operand.
> All this being said, RISC "won" in the sense that many RISC principles have become the "standard" principles of designing an ISA.
I disagree. Maybe many RISC chip design ideas may have taken over, but only because there are massive transistor budgets. I'd like to see a RISC chip that actually has a basic instruction set. As in, not having media instructions, SIMD instructions, crypto primitives, etc. If anything, Moore's Law won and the RISC v CISC battle became meaningless and they can just spend transistors to make every instruction faster if they care to.
RISC vs CISC is nonsense.
Pre-RISC CPU designs were pragmatic responses to the design constraints of their time. (expensive memory, poor compilers)
RISC was a pragmatic response to the design constraints of its time. (memory becomes less expensive, transistor budgets are tight, and compilers are a little better)
Post-RISC designs of today are, also, only pragmatic responses to the design constraints of today. Those constraints are different than they were in the 80s and 90s.
The supposed dichotomy is just utter horseshit. It was invented as a marketing campaign to sell CPUs. It was canonized by the most popular text books being written by /Team RISC/.
I wish, as an industry, we’d just get over it, move on, and stop talking about it so much.
Yep. And then we learned that transistors were almost free, could be purchased in lot quantities of 1 billion, and could be used to create a translation layer between a CISC instruction stream and a RISC core.
> and a RISC core.
not actually a RISC core. µops are simpler than the externally visible instruction set but they are not RISC.
(There were early x86 implementations that really were an x86 decoder bolted onto a preexisting RISC design. They didn't really perform well.)
Turns out it's useful to have memory operands -- even read-modify-write operands -- in your µops. Turns out it's useful to have instructions that are wider than 32 bits. Turns out it's useful to have big literals, even if the immediate field has to be shared by a whole group of 3-4 µops (so only one of them can have a big literal). Flags are also not necessarily the problem RISC people said it was (and several RISCs actually do have flags despite the Computer Architecture 101 dogmas). Just having an reg+ofs address mode turns out to be a bad idea.
Indirect addressing à la PDP-11, VAX, and 68K (and many long forgotten architectures) turned out not to be a good idea, of course.
Sure, it’s not coded the same way as your external ISA does it (like you said, lots of wide information), but the core is optimized for a load/store architecture with ops that generally execute in a single pipeline stage, and the big, complex instructions in the ISA map to multiple smaller micro-ops. But yes, it is NOT literally an x86 decoder wrapped around a MIPS or ARM core.
No. NOT a load/store architecture. People tried that in the 90's for x86. Doesn't work nearly as well as keeping µops generally the same as the simpler move/ALU instructions of the external ISA, just encoded differently. That's what modern x86 uses, that's what modern z/Arch ("S/360-31-64++") uses.
x86 and z/Arch instructions for move/ALU/jmp/conditional branch are fine. They are not hard to decode and they are not hard to execute. That's the core of the µop instruction set (just encoded differently). Then they add some specialty stuff necessary to implement more complex instructions using sequences of µops -- and of course the SIMD stuff. That's it. It's a simpler version of the external ISA, NOT a RISC.
Part of the path the µops take is wide, possibly with the option of a shared field for a large immediate.
Instructions that map to short, fixed-length µops sequences are handled directly by the decoders that spit out a wide chunk of µops. Longer or variable-length µops are handled by having the decoders emit an index into a ROM of wide µops chunks. The decoders can often spit out both the first wide chunks AND the index (so there's no need to wait a cycle for the ROM to emit µops).
Multi-cycle µops are not much of problem, as long as the cycle count is predictable, preferably statically predictable. It is common to have µops that are "multi-issue", for example if they involve memory operands.
Hm. Maybe I’m using terms incorrectly or my model for what is happening is incorrect. Please educate me.
“NOT a load/store architecture.” What do you mean by this, exactly?
“It's a simpler version of the external ISA, NOT a RISC.” How are defining “RISC,” exactly? What makes it not a RISC given that you’re also saying it’s simpler?
Transistors aren’t free. x86 cores are all bigger than their ARM competitors even when on the same node while also getting worse performance per watt.
The translation layers cost time and money to build which could be spent making the rest of the chip faster. They stuck up extra die area and use power.
ARM’s total income in 2024 was half of AMDs R&D budget, but the core they finished that year can get x86 desktop levels of performance in a phone.
The idea that there’s no cost to x86 just doesn’t seem to hold up under even mild scrutiny.
Yes, that’s why I said almost free. But that said, as I understand it, the x86 decoders aren’t that much of the die area of a modern design. Most of it is L1 cache, GPUs, neural engines, etc., which makes simple die area comparisons of modern processors a bit useless for this particular question. You’re really comparing all that other stuff. To be clear, I’m squarely on the RISC side of any “debate,” but it was interesting to watch how the CISC designs evolved in the early 2000s to maintain their relevance.
The “no cost” is an evidence free zone.
ARM shed 75% of their decoder size when removing support for the 32-bit ISA in their A-series cores and it’s nowhere near as bad as x86.
A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder. That’s 22% of total power. Even if you say that’s the high-water mark and it’s usually half of that, it’s still a massive consideration.
If the decoder power and size weren’t an issue, they’d have moved past 4-wide decode years ago. Instead, golden cove only recently went 6 wide and power consumption went through the roof. Meanwhile, AMD went with a ludicrously complex double 4-wide decoder setup that limits throughput under some circumstances and creates all kinds of headaches and hazards that must be dealt with.
Nobody would do this unless the cost of scaling x86 decided was immense. Instead, they’d just slap on another decoder like ARM or RISC-V does.
> A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder.
You are mixing studies, the Haskell paper only looked at total core paper (and checked how it was impacted by algorithms). It was this [1] study that provided the 4.8w out of 22.1w number.
And while it's an interesting number, it says nothing about the overhead of decoding x86. They chose to label that component "instruction decoder" but really they were measuring the whole process of fetching instructions from anywhere other than the μop cache.
That 4.8w number includes the branch predictor, the L1 instruction cache, potentially TLB misses and instruction fetch from L2/L3. And depending on the details of their regression analysis, it might also include things after instruction decoding, like register renaming and move elimination. Given that they show the L1 data cache as using 3.8w, it's entirely possible that the L1 instruction cache actually makes up most of the 4.8w number.
So we really can't use this paper to say anything about the overhead of decoding x86, because that's not what it's measuring.
[1] https://www.usenix.org/system/files/conference/cooldc16/cool...
I’m not sure what point you’re trying to get me to concede. I’ve already stated that I’m on the RISC side of the debate. If your point is that it’s difficult to keep doing this with x86, I won’t argue with you. But that said, the x86 teams have kept up remarkably well so far. How far can they keep going? I don’t know. They’re already a lot further than everyone predicted they’d be a couple decades ago. Nearly free transistors, even if not fully free, are quite useful, it turns out.
We'll see. Free transistors are over. Cost per transistor has been stagnant or slightly increasing since 28nm.
https://www.semiconductor-digest.com/moores-law-indeed-stopp...
Sure, they’re getting less free and they were never completely free in any case. But that’s a straw man that nobody was trying to argue. So (again), what’s your point?
This is not really true. For example, an Apple M1 has firestorm cores that are around 3.75mm^2, and a zen3 core from around the same era is just 3mm^2 (roughly).
I can make a silly 20 wide arm cpu and a silly 1 wide x86 cpu and the x86 will be smaller by a lot.
Where did you get your numbers?
M1 core is 2.28mm2 and Zen3 core is 4.05mm2 if you count all the core-specific stuff (and 3.09mm2 even if you exclude the power regulation and the L2 cache that only this core can use). That is 27-45% larger for generally worse performance (and all-around worse performance if per-core power is constrained to something realistic). I'd also note that Oryon and C1-Ultra seem to be much more area efficient than more recent Apple cores.
We're at a point where Apple's E-cores are getting close to Zen or Cove cores in IPC while using just 0.6w at 2.6GHz.
If you count EVERYTHING outside of the core (power, matrix co-processor, last-level-cache, share logic, etc), and average out, we get 15.555mm2 / 4 = 3.89mm2 for M1.
If we do the same for Zen3 (excluding tests and infinity fabric), we have 67.85mm2 / 8 = 8.49mm2.
M1 has 12mb of last-level cache (coherent) or 3mb per core while Zen3 has 4mb of coherent L1 and 32mb of victim cache (used to buffer IO/RAM read/write and to hold cache misses in hopes that they can be used eventually). You can analyze this how you would like, but M1's cache design is more efficient and gives a higher hit rate despite being smaller. Chips and Cheese has an interesting writeup of this as applied to Golden Cove.
https://semianalysis.com/2022/06/10/apple-m2-die-shot-and-ar...
https://x.com/Locuza_/status/1538696558920736769
https://semianalysis.com/2022/06/17/amd-to-infinity-and-beyo...
https://chipsandcheese.com/p/going-armchair-quarterback-on-g...
What point are you trying to argue with everyone? You seem to be quibbling over everything without stating an actual POV.
They called me out as being wrong then cited incorrect data to support their claim. I responded with the real numbers and sources to back them up. What else should be done?
Not for VAX, see the recent thread on it [1].
[1] https://news.ycombinator.com/item?id=45378413
I’m not sure I get your point? Are you saying it is impossible to accelerate the VAX instruction set via the same technique used on x86? If so, you’ll have to explain why. Now, whether you’d want to or not is another question.
Yes, it's not possible to do it the same way, because what made x86 successful in their application is that x86 was remarkably RISC-y in actual behaviour compared to 68k or VAX.
The main reason for that is that x86 code, outside of being register poor leading to lots of stack etc. use, decomposes most "ciscy" operations into LEA (inlined in pipeline) + memory access + actual operation.
VAX (and to lesser extent, m68k and others) had multiple indirections just to get the operands right, way more than what is essentially single LEA instruction. The most complex VAX instructions could have been ignored as "this is super slow one and rarely used", but the burden of handling the indirections remain, including possible huge memory latency costs.
Also, the VAX instruction encoding is a class of horror above that of x86.
few classes above.
Only ISA where I've seen a single instruction span two memory pages despite being page aligned
RISC-V is the better sweet-spot, and has no strong IP locks like arm or x86_64. Not to mention the silicon of nowdays changes everything: you avoid silicon design complexity as much as possible since it will be more than performant enough for the bulk of the programs out there.
> RISC-V is the better sweet-spot
See https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d99... for an ex-ARM's engineer's critic of RISC-V.
HN discussion: https://news.ycombinator.com/item?id=24958423
That discussion is five years old and a lot has happened in the RISC-V world since.
64-bit RISC-V finally achieved feature-parity with 64-bit ARM, with the RVA22 and RVA23 profiles. (There are no RVA23-compliant chips taped out yet, but several are expected early next year)
RISC-V's core was intentionally made small, to be useful for teaching and for allowing very tiny cores to be made for embedded systems. The extensibility has resulted in a few trade-offs that are different from ARM and can definitely be discussed, but the extensibility is also one of the RISC-V ecosystem's strengths: Embedded chips can contain just the extensions that are needed. Proprietary extensions are also allowed, which have been used to prototype and evaluate them when developing official extensions.
For any fair comparison between ARM and RISC-V, you should compare the right ARM ISA against the right RISC-V ISA. ARM Cortex-M0 against RV32IC, ARMv9 AArch64 against RVA23, etc.
Yeah that is pretty out of date. I'd say around half of those points have actually been addressed since then.
And the other half are true but irrelevant.
OK, if it has been addressed or it is irrelevant:
How do I do an "add with carry" (or subtract with carry/borrow) on RISC-V (for this, of course, the addition has to set a carry flag, and the subtraction either has to set a carry or borrow flag).
This feature is very important for a high-performant implementation of arbitrary-precision arithmetic.
Yes yes that's the other widely quoted criticism of RISC-V, from a GNU MP maintainer.
At the time there was no widely-available RISC-V hardware. There is now, and earlier this year I tested it.
It turns out that the GNU MP project's own benchmark runs better on SiFive RISC-V cores than on comparable µarch Arm cores, specifically better on U74 than on A53, and better on P550 than on A72, despite (obviously) the Arm cores having carry flags and ADC instructions and the RISC-V cores not having those.
The ensuing discussion also came to a consensus that once you get to very wide cores e.g. 8+ like Apple's M1 and several upcoming RISC-V cores, carry flags are ACTIVELY BAD because they serialise the computation (even with OoO, renaming the flags register etc) while only one 64 bit A + 64 bit B + 1 bit carry-in limb computation in 18,446,744,073,709,551,616 has an actual carry-out dependency on the carry-in, so you can almost always simply add all the limbs in parallel. The carry-in only affects carry-out when A+B is exactly 0xFFFFFFFFFFFFFFFF.
Full thread here:
https://www.reddit.com/r/RISCV/comments/1jsnbdr/gnu_mp_bignu...
Well, maybe. I feel like he still has a point with array indexing and JAL.
Ofc arm people will go after risc-v: it is a death sentence for them...
Come on...
But a real-life ISA doing a good enough job, without any global strong IP locks like x86-64 or arm... yummy.
A quick Google search for the number of processors shipped in 2024 shows nVidia shipping one-billion riscv _cores_ last year being pretty big news. Assuming they were one fourth of the market, and that there was just one core per chip (certainly not true) that’s four billion cores.
Meanwhile arm licensees shipped twenty-nine billion chips. Most with four cores, and some with many more.
Riscv is on the way up, but any death sentence for arm is decades away.
Qualcomm is rumored to be paying 2-2.5% per chip instead of the 5-5.5% ARM sued them for (some sources claim it was 6-10%).
No matter how you slice up their handset and mobile divisions profits, the current cost is billions just to license an ISA.
Qualcomm could likely pay for most of their chip design costs with the amount of money they’re giving ARM. When contract renewal comes up, Qualcomm will be even more incentivized to move to RISC-V.
Apple has apparently been moving on chip soft cores to RISC-V. At some point (probably around renewal time), they’re likely to want to save money by switching to RISC-V too.
There are even rumors of ARM working on RISC-V cores (I’d guess smaller cores to compete with SiFive).
There’s an economic incentive to change and the only blocker is ARM inertia, but making ARM emulate quickly on RISC-V is almost certainly easier than making x86 and all its weirdness/edge cases run on ARM.
Once the software is in place (getting close) and a competitive RISC-V phone chip exists, I suspect the conversion will be very fast.
Apple probably pays a lot less than Qualcomm? They have a unique (details unknown) license. They have a license to 2040 now too: https://www.macrumors.com/2023/09/06/apple-inks-new-deal-arm...
I do write RISC-V, x86_64 assembly, and I had to code a bit of ARM64 assembly (because of an old RPI).
Well, could not really make the difference between RISC_V and ARM64.
> the current cost is billions just to license an ISA.
It used to be that instruction sets were not copyrightable, and you had to resort to implementation patents to extract money (e.g. the MIPS vs Lexra case https://www.eetimes.com/mips-technologies-sues-lexra-for-pat... )
Most architectures never actually die. New architectures just snap up the balance of the new designs. That said, a competitive license-free ISA is quite attractive to a lot of people, and once the snowball gets moving it will grow in size and pick up speed.
????
Do I really need to add "on the long run"? Because this is super obvious. That stuff does not happen over night :)
You can feel already the pressure of arm fan boys attacking aggressively RISC-V stuff. This is a good sign.
Don't forget: RISC-V has no strong IP ISA locks like arm and x86_64...
In the long run we are all dead. So a “death sentence” far enough out is meaningless. Riscv killing arm is pretty distant.
They will all be replaced sooner or later.
I have no particular affinity for arm or riscv. Playing fan-boi for any technology is silly.
At least, I did not choose the side of an ISA with global super strong IP locks like some...
There are much more toxic fan boys than some other fan boys...
For example, Intel can brand their own chips as RISC-V-ZIntelx8664, if they want to, because RISC-V-Zxxx implementation can be as incompatible with any other RISC-V implementations (and even specification) as one wants.
Honestly, I think the RISC-V guys would LOVE a dual x86/RISC-V hybrid processor as it would make the transition much more smooth.
>RISC-V-Zxxx implementation can be as incompatible with any other RISC-V implementations (and even specification) as one wants.
It can't. Extensions have a lot of freedom, but still have to follow the core spec. You can't just submit amd64 as a riscv extension for a myriad of reasons, it clearly conflicts with the base ISA.
This is also why there is a trademark. If you or Intel wanted to deliberately create nonsense to be disruptive, you are allowed to create it, but then you can't call it RISC-V without receiving some fan-mail from lawyers.
What you could do is add a custom vendor extension, maybe XIntel64, which adds a CSR that can toggle between RISC-V and x86 mode.
...although you could certainly implement a Jazelle-style "branch to x86" instruction if you really wanted. It's fine as long as you come up in RVI mode.
Unfortunately you would not be able to repurpose the LSB of jalr targets as an "x86 bit" à la Thumb because the standard requires you to ignore that bit, though of course you could enable that behaviour with a custom CSR.