Triple decoder is one unique effect. The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO. Very well done.
Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies.
I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).
Big cores (P cores or AMD Zen5) obviously can split into 2 hyperthread, but what if that division is still too big? E cores are 4 threads of support in roughly the same space as 1 Pcore.
This is because L2 cache is shared/consolidated, and other resources (ROP buffers, register files, etc. etc.) are just all so much smaller on the Ecore.
It's an interesting design. I'd still think that growing the cores to 4way SMT (like Xeon Phi) or 8way SMT (POWER10) would be a more conventional way to split up resources though. But obviously I don't work at Intel or can make these kinds of decisions.
> The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO
Not just small loops. It can reach 9x instruction decode on almost any control flow pattern. It just looks at the next 3 branch targets and starts decoding at each of them. As long as there is a branch every 32ish instructions (presumably a taken branch?), Skymont can keep all three uop queues full and Rename/dispatch can consume uops at a sustained rate of 8 uops per cycle.
And in typical code, blocks with more than 32 instructions between branches are somewhat rare.
But Skymont has a brilliant trick for dealing with long runs of branchless code too: It simply inserts dummy branches into the branch predictor, breaking them into shorter blocks that fit into the 32 entry uop queues. The 3 decoders will start decoding the long block at three different positions, leap-frogging over each-other until the entire block is decoded and stuffed into the queues.
This design is absolutely brilliant. It seems to entirely solve the issue decoding X86, with far less resources than a uop cache. I suspect the approach will scale to almost unlimited numbers of decoders running in parallel, shifting the bottlenecks to other parts of the design (branch prediction and everything post decode)
Thanks for the explanation. I was wondering how the heck Intel did to make a 9-way decode x86–a low power core of all things. Seems like an elegant approach.
The important bit: Intel E-cores now have 3x decoders each with the ability for 3-wide decode. When they work as a team, they can perform 9 decodes per clock tick (which then bottlenecks to 8 renamed uops in the best case scenario, and more than likely ~4 or ~3 more typical uops).
3-4 uops per cycle is more of an average throughput than a typical throughput.
The average is dragged down by many cycles that don't decoded/rename any uops. Either waiting for bytes to decode (icache miss, etc) or rename is blocked because the ROB is full (probably stalled on a dcache miss).
So you want a quite wide frontend so that whenever you are unblocked, you can drag the average up again.
While the frontend of Intel Skymont, which includes instruction fetching and decoding, is very original and unlike to that of any other CPU core, the backend of Skymont, which includes the execution units, is extremely similar to that of Arm Cortex-X4 (which is a.k.a. Neoverse V3 in its server variant and as Neoverse V3AE in its automotive variant).
This similarity consists in the fact that both Intel Skymont and Arm Cortex-X4 have the same number of execution units of each kind (and there are many kinds of execution units).
Therefore it can be expected that for any application whose performance is limited by the CPU core backend, the CPU cores Intel Skymont and Arm Cortex-X4 (or Neoverse V3) should have very similar performances.
Moreover, Intel Skymont and Arm Cortex-X4 have the same die area, i.e. around 1.7 square mm (including with both cores 1 MB of L2 cache in this area). Therefore the 2 cores not only should have about the same performance for backend-limited applications, but they also have the same cost.
Before Skymont, all the older Intel Atom cores had been designed to compete with the medium-size Arm Cortex-A7xx cores, even if the Intel Atom cores have always lagged in performance Cortex-A7xx by a year or two. For instance Intel Tremont had a very similar performance to Arm Cortex-A76, while Intel Gracemont and Crestmont have an extremely similar core backend with the series of Cortex-A78 to Cortex-A725 (like Gracemont and Crestmont, the 5 cores in the series Cortex-A78, Cortex-A710, Cortex-A715, Cortex-A720 and Cortex-A725 have only insignificant differences in the execution units).
With Skymont, Intel has made a jump in E-core size, positioning it as a match for Cortex-X, not for Cortex-A7xx, like its predecessors.
Well the recent Cortex X5 or 925 is already at around 3.4mm2 so that comparison isn't exactly accurate. But I would love to test and see results on Skymont compared to X4. But I dont think they are available yet ( as an individual core ).
I am really looking forward to Clearwater Forest which is Skymont on 18A for Server.
And I know I am going to sound crazy but I wouldn't mind a small SoC based on Skymont and Xe2 Graphics for Smartphone to Tablets.
> Clearwater Forest which is Skymont on 18A for Server.
Clearwater Forest will be using a further generation improved E-core, Darkmont, which will also sit on top of large local caches using Foveros Direct 3D (like AMD's X3D design). [1]
Likely Darkmont will be a server tweaked version of Skymont, but there is no public info yet available.
This is possibly the critical product which will determine if Intel will be viable from a manufacturing and design perspective...If it gets released in the next 6-9 months with good thermals, IPC, and clock speeds, Intel will have a major winner on its hands. If not....
Like I have said, Intel Skymont is a very close match for Cortex-X4, not for Cortex-X925.
With Cortex-X925 Arm has made a big jump in core size, departing from the previous Cortex-X series, which has allowed a good increase in IPC, greatly improving the results of single-threaded benchmarks, but this has been paid by a much worse performance per area, making Cortex-X925 completely unsuitable for multi-threaded applications. Therefore Cortex-X925, like also Intel Lion Cove, is useful only when it is accompanied by smaller cores that handle the multi-threaded workloads.
So unlike with previous Arm cores, Cortex-X925 has not made Cortex-X4 obsolete, as demonstrated e.g. in MediaTek Dimensity 9400, which includes 1 Cortex-X925 to get good single-threaded benchmark scores, together with 3 Cortex-X4 to get good multi-threaded benchmark scores.
It is not clear which are the intentions of Arm for the evolution of the Cortex-X series. The rumors are that the next core configuration for smartphones is intended to be like that already deployed by Qualcomm with its custom cores, i.e. to have a big core that is 3 times bigger than the medium-size core and to use 2 big Cortex-X930 cores + 6 medium-size Cortex-A730 cores, for an even split in die area between the big cores and the medium-size cores.
For this to work well, Cortex-X930 must provide a good improvement in performance per area over Cortex-X925, because otherwise it would be hard to justify a 2+6 arrangement, when in the same die area one could have implemented a 1+9 configuration, with the same single-threaded performance, but with better multi-threaded performance.
I believe that a small SoC with only 4 Skymont cores and Xe2 graphics would provide performance, battery lifetime and cost for a smartphone that would be completely competitive with any existing Qualcomm, MediaTek or Samsung SoC.
This would be less obvious in a benchmark like GeekBench 6, where Cortex-X925 or Qualcomm Oryon L would show a greater single-threaded score, but the difference would not be great enough to actually matter in real usage. Also for multi-threaded performance measured by GB6, only 4 Skymont cores would seem to be a little slower than the current flagships, but that would be misleading, because 4 Skymont cores could run at full speed for long durations within the smartphone power constraints, while the current 8-core flagships can never run all 8 cores at the 100% performance recorded by GB6, without overheating after a short time.
An 8-core Skymont SoC would be excellent for a cheap tablet with long battery lifetime and great performance, even if again, such a configuration would be penalized by GB6, which favors having 1 huge core, like Cortex-X925, for the ST score, together with an over-provisioned set of medium-size cores, which can run all together only for the short time required to complete the GB6 sub-benchmarks, but in real prolonged usage must never be all completely busy at the same time, in order to avoid overheating.
Skymont area efficiency should be compared to Zen 5C on 3nm. It has higher IPC, SMT with dual decoders - one for each thread, and full rate AVX-512 execution.
AMD didn't have major difficulties in scaling down their SMT cores to achieve similar performance per area. But Intel went with different approach. At the cost of having different ISA support on each core in consumer devices and having to produce an SMT version of their P cores for servers anyway.
It should be noted that Intel Skymont has the same area and it should also have the same performance for any backend-limited application with Arm Cortex-X4 (a.k.a. Neoverse V3) (both use 1.7 square mm in the "3 nm" TSMC fabrication process, while a Zen 5 compact might have an almost double area in the less dense "4 nm" process, with full vector pipelines, and a 3 square mm area with reduced vector pipelines, in the same less dense process).
Arm Cortex-X4 has the best performance per area of among the cores designed by Arm. Cortex-X925 has a double area in comparison with Cortex-X4, which results in a much lower performance per area. Cortex-A725 is smaller in area, but the area ratio is likely to be smaller than the performance ratio (for most kinds of execution units Cortex-X4 has a double number, while for some it has only a 50% or a 33% advantage), so it is likely that the performance per area of Cortex-A725 is worse than for Cortex-X4 and for Skymont.
For any programs that benefit from vector instructions, Zen 5 compact will have a much better performance per area than Intel Skymont and Arm Cortex-X4.
For programs that execute mostly irregular integer and pointer operations, there are chances for Intel Skymont and Arm Cortex-X4 to achieve better performance per area, but this is uncertain.
Intel Skymont and Arm Cortex-X4 have a greater number of integer/pointer execution units per area than Zen 5 compact, even if Zen 5 compact were made with a TSMC process equally dense, which is not the case today.
Despite that, the execution units of Zen 5 compact will be busy a much higher percentage of the time, for several reasons. Zen 5 is better balanced, it has more resources for ensuring out-of-order and multithreaded execution, it has better cache memories. All these factors result in a higher IPC for Zen 5.
It is not clear whether the better IPC of Zen 5 is enough to compensate its greater area, when performing only irregular integer and pointer operations. Most likely is that in such cases Intel Skymont and Arm Cortex-X4 remain with a small advantage in performance per area, i.e. in performance per dollar, because the advantage in IPC of Zen 5 (when using SMT) may be in the range of 10% to 50%, while the advantage in area of Intel Skymont and Arm Cortex-X4 might be somewhere between 50% and 70%, had they been made with the same TSMC process.
On the other hand, for any program that can be accelerated with vector instructions, Zen 5 compact will crush in performance per area (i.e. in performance per dollar) any core designed by Intel or Arm.
> It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).
Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async
Not... quite. I think you've got the cause-and-effect backwards.
Programmers who happen to write multiple-threaded programs don't need powerful cores, they want more cores. A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.
Programmers who happen to write powerful singled-threaded programs need powerful cores. For example, AMD's "X3D" line of CPUs famously have 96MB of L3 cache, and video games that are on these very-powerful cores have much better performance.
Its not "Programmers should change their code to fit the machine". From Intel's perspective, CPU designers should change their core designs to match the different kinds of programmers. Single-threaded (or low-thread) programmers... largely represented by the Video Game programmers... want P-cores. But not very much of them.
Multithreaded programmers... represented by Graphics and a few others... want E-cores. Splitting a P-core into "only" 2 threads is not sufficient, they want 4x or even 8x more cores. Because there's multiple communities of programmers out there, dedicating design teams to creating entirely different cores is a worthwhile endeavor.
--------
> Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async
Power-efficiency is going to be incredibly difficult moving forward.
It should be noted that E-cores are not very power-efficient though. They're area efficient, IE Cheaper for Intel to make. Intel can sell 4x as many E-cores for roughly the same price/area as 1x P-core.
E-cores are cost-efficient cores. I think they happen to use slightly less power, but I'm not convinced that power-efficiency is their particular design goal.
If your code benefits from cache (ie: big cores), its probable that the lowest power-cost would be to run on large caches (like P-cores or Zen5 or Zen5 X3D). Communicating with RAM is always more power than just communicating with caches after all.
If your code does NOT benefit from cache (ie: Blender regularly has 100GB+ scenes for complex movies), then all of those spare resources on P-cores are useless, as nothing fits anyway and the core will be spending almost all of its time waiting on RAM to do anything. So the E-core will be more power efficient in this case.
> A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.
Is this true? In most of my work I'd usually rather have a single serializable thread of execution. Any parallelism usually comes with added overhead of synchronization, and added mental overhead of having to think about parallel execution. If I could freely pick between 4 IPC worth of single core or 1 IPC per core with 4 cores I'd pretty much always pick a single core. The catch is that we're usually not trading like for like. Meaning I can get 3 IPC worth of single core or 4 IPC spread over 4 cores. Now I suddenly have to weigh the overhead and analyze my options.
Would you ever rather have multiple cores or an equivalent single core? Intuitively it feels like there's some mathematics here.
Indeed a single thread is most simple to reason about, but if you have a single task that can already use 2 cores uniformly, going to 8 cores (assuming enough workload) should be a pretty clean 4x speedup (as long as you don't run into memory bandwidth limits, but that'd cap the single-threaded code too).
But the performance difference between E-core and P-core perf is way less than 4x; the OP article shows a 1.6x/1.7x difference in SPEC for skymont vs lion cove, and 1.3x/1.7x for crestmont vs redwood code; and some searching around for past generations gives numbers around 1.4x.
Increasing core counts being a much more area- and energy-efficient way for hardware to provide more total performance than making the individual cores faster is a pretty fundamental thing.
E-cores can typically execute ~4-instructions per clock tick in highly optimized code.
P-cores go up to like... 6-instructions. Better yes, but not dramatically better. The real issue is that P-cores have far more resources than E-cores: deeper reorder buffers to perform more out-of-order execution. Deeper branch prediction, more register files, larger caches, etc. etc.
So P-cores should be hitting the max of 6-instructions per clock tick on more kinds of code. E-cores have much smaller caches (and other resources) so they'll run out and start stalling out to memory-limitations, which is like 0.1 instructions per clock tick or slower.
----------
But guess what? If a fancy P-core is memory-bound (like a lot of Blender code, due to the large-scale dozens+ GBs nature of modern 3d scenes), then those fancy P-cores run out of resources and are 0.1 IPC as well.
If both P-cores and E-cores are stalled out waiting on memory, you'd rather have 32x E-Cores all executing at 0.1 IPC, rather than only 8x P-cores executing at 0.1 IPC.
Its going to be a complex world moving forward: modern CPUs are growing far more complex and its not clear what the tradeoffs will be. But this reality of E-core and P-cores stalling out and waiting on memory is just how modern code works in too many cases.
And remember, its 4x E-cores are equivalent in area/costs to 1x P-core. So there's no contest in terms of overall instructions-per-second for E-core vs P-cores. The E-cores simply are better, even if the individual threads run slower.
Obviously it is easier to write any program as a single sequential thread, because you do not need to think about the dependencies between program statements. When you append a statement, you assume that all previous statements have been already executed, so the new statement can access without worries any data it needs.
The problem is that the speed of a single thread is limited and there exists no chance to increase it by significant amounts.
As long as we will continue to use silicon, there will be negligible increases in clock frequency. Switching to other semiconductors might bring us a double clock frequency in 10 years from now, but there will never be again a decade like that from 1993 to 2003, when the clock frequencies have increased 50 times.
The slow yearly increase in instructions per clock cycle is obtained by making the hardware do more and more of the work that has not been done by the programmer or the compiler, i.e. by extracting from the sequential program the separate chains of dependent instructions that should have been written as distinct threads, in order to execute them concurrently.
This division of a single instruction sequence into separate threads is extremely inefficient when done at runtime by hardware. Because of this the CPU cores with very high IPC have lower and lower performance per area and per power with the increase of the IPC. Low performance per area and per power means low multithreaded performance.
So the CPU cores with very good single-threaded performance, like Intel Lion Cove or Arm Cortex-X925 have very poor multi-threaded performance and using many of them in a CPU would be futile, because in the same limited area one could put many more small CPU cores, achieving a much higher total performance.
This is why such big CPU cores that are good for single-threaded applications must be paired with smaller CPU cores, like Intel Skymont or Arm Cortex-X4, in order to obtain a good multi-threaded performance.
Writing the program as a single thread is easy and of course it should always be done so when the achieved performance is good enough on the current big superscalar CPU cores.
On the other hand, whenever the performance is not sufficient, there is no way to increase it a lot otherwise than by decomposing the work that must be done into multiple concurrent activities.
The easy case is that of iterations, which frequently provide large amounts of work that can be done concurrently. Moreover, with iterations there are many tools that can create concurrent threads automatically, like OpenMP or NVIDIA CUDA.
Where there are no iterations, one may need to do much more work to identify the dependencies between activities, in order to determine which may be executed concurrently, because they do not have functional dependencies between them.
However, when an entire program consists of a single chain of dependent instructions, which may happen e.g. when computing certain kinds of hash functions over a file, you are doomed. There is no way to increase the performance of that program.
Nevertheless even in such cases one can question whether the specification of the program is truly what the end user needs. For instance, when computing a hash over a file, the actual goal is normally not the computation of the hash, but to verify whether the file is the same as another (where the other file may be a past version of the same file, to detect modification, or an apparently distinct file coming from another source, when deduplication is desired). In such cases, it does not really matter which hash function is used, so it may be acceptable to replace the hash algorithm with another that allows concurrent computation, solving the performance problem.
Similar reformulations of the problem that must be solved may help in other cases where initially it appears that it is not possible to decompose the workload into concurrent tasks.
> However, when an entire program consists of a single chain of dependent instructions, which may happen e.g. when computing certain kinds of hash functions over a file, you are doomed. There is no way to increase the performance of that program.
Even in that case, you would probably benefit from having many cores because the user is probably running other things on the same machine, or the program is running on a runtime with eg garbage collector threads etc. I’d venture it’s quite rare that the entire machine is waiting on a single sequential task!
> I’d venture it’s quite rare that the entire machine is waiting on a single sequential task!
But that happens all the time in video game code.
Video games may have many threads running, but there's usually a single-thread bottleneck. To the point that P-cores and massively huge Zen5 cores are so much better for video games.
Javascript (ie: rendering webpages) is single-threaded bound, which is probably why the Phone makers have focused so much on making bigger cores as well. Yes, there's plenty of opportunities for parallelism in web browsers and webpages. But most of the work is in the main Javascript thread called at the root.
That's just more parallelism, they'll take their parallelism wherever they can get it.
It's to be seen if the future is more SIMD or more smaller general processors. Arguably the latter are more flexible but maybe not as efficientas the former.
I like Zen5 as much as the next guy, but it should be noted that even today's most recent version of Blender is AVX (256-bit) only. That means E-cores remain the optimal core to work with for a lot of Blender stuff.
Hopefully AMD Zen5 AVX512 becomes more popular. Maybe it'd become more popular as Intel rolls out AVX10 (somewhat compatible instruction set)
AVX512 is one of the best instruction sets I've seen. No joke.
There's all kinds of things AVX512 would help out in Blender. But those ways are incompatible with older AVX2 or SSE code. The question is if Blender will be willing to support SSE, AVX, and AVX512 code paths. Each new codepath is more maintenance and more effort.
AVX512 has more registers: not just 32x 512-bit registers (AVX normally has 16x 256-bit registers), but also the kmask registers (64-bits that take the place of old boolean logic that used to be done on the 256-bit registers). This alone should give far more optimizations for the compiler to automatically find.
There's also VPCOMPRESSB and VPEXPANDB, Conflict-detection, and other instructions that make new SIMD data-structures far more efficient to implement. But this requires deep understanding that very few programmers have yet.
Cloth physics in Blender are stored in RAM (as scenes and models can grow very large, too large for a GPU).
Figuring out which verticies for a physics simulation need to be sent to the GPU would be time, effort, and PCIe traffic _NOT_ running the cloth physics.
Furthermore, once all the data is figured out and blocked out in cache, its... cached. Cloth physics only interacts with a small number of close, nearby objects. Yeah, you _could_ calculate this small portion and send it to the GPU, but CPU is really good at just automatically traversing trees and storing the most recently used stuff in L1, L2, and L3 caches automatically (without any need of special code).
All in all, I expect something like Cloth physics (which is a calculation Blender currently does on CPU-only), is best done CPU only. Not because GPUs are bad at this algorithm... but instead because PCIe transfers are too slow and cloth physics is just too easily cached / benefited by various CPU features.
It'd be a lot of effort to translate all that code to GPU and you likely won't get large gains (like Raytracing/Cycles/Rendering gets for GPU Compute).
I wonder what the difference is between the cloth physics you are talking about and the one NVIDIA has been doing for I think more than a decade now? Is it scale? It sounds like, at least, there are alternatives that do it on the GPU and there are questions if Blender will do it on the GPU:
Cloth / Hair physics in those games were graphics-only physics.
They could collide with any mesh that was inside of the GPU's memory. But those calculations cannot work on any information stored on CPU RAM. Well... not efficiently anyway.
---------
When the Cloth simulator in Blender runs, it generates all kinds of information the CPU needs for other steps. In effect, Blender's cloth physics serves as an input to animation frames, which is all CPU-side information.
Again: i know cloth physics executes on GPUs very well in isolation. But I'd be surprised if BLENDER's specific cloth physics would ever be efficient on a GPU. Because as it turns out, calculations kind of don't matter in the big-picture. There's a lot of other things you need to do after those calculations (animations, key frames, and other such interactions). And if all that information is stored randomly in 100GB of CPU RAM, it'd be very hard to untangle that data and get it to a GPU (and back).
In a Video Game PHYSX setting, you just display the cloth physics to the screen. In Blender, a 3d animation program, you have to do a lot more with all that information and touch many other data-structures.
> I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).
AMD did something similar before. Anyone don't remember the Bulldozer cores sharing resources between pair of cores ?
In Bulldozer, two cores shared a single Floating Point Unit. A bet was made made that majority calculations will be done in integers. Unfortunately, actual world requires decimal places, and languages like Javascript only have floating point numbers. Bulldozer flopped hard.
AMD's modern SMT implementation is just better than Bulldozer's decoder sharing.
Modern Zen5 has very good SMT implementations. Back 10 years ago, people mostly talked about Bulldozer's design to make fun of it. It always was intriguing to me but it just never seemed to be practical in any workflow.
"Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies."
@dragontamer solid point. Consider a in memory ring shared between two threads. There's huge difference in throughput and latency if the threads share L2 (on same core) or when on different cores all down to the relative slowness of L3.
Are there other cpus (arm, graviton?) that have similarly shared L2 caches?
Hyperthreading actually shares L1 caches between two threads (after all, two threads are running in the same L1 cache and core).
I believe SMT4 and SMT8 cores from IBM Power10 also have L1 caches shared (8 threads on one core), and thus benefit from communication speeds.
But you're right in that this is a very obscure performance quirk. I'm honestly not aware of any practical code that takes advantage of this. E-cores are perhaps the most "natural" implementation of high-speed core-to-core communications though.
What we desperately need before we get too deep into this is stronger support in languages for heterogeneous cores in an architecture agnostic way. Some way to annotate that certain threads should run on certain types of cores (and close together in memory hierarchy) without getting too deep into implementation details.
I don't think so. I don't trust software authors to make the right choice, and the most tilted examples of where a core will usually need a bigger core can afford to wait for the scheduler to figure it out.
And if you want to be close together in the memory hierarchy, does that mean close to the RAM that you can easily get to? And you want allocations from there? If you really want that, you can use numa(3).
> without getting too deep into implementation details.
Every microarchitecture is a special case about what you win by being close to things, and how it plays with contention and distances to other things. You either don't care and trust the infrastructure, or you want to micromanage it all, IMO.
I'm talking about close together in the cache. If a threadpool manager is hinted that 4 threads are going to share a lot of memory, they can be allocated on the same l2 cache. And no matter what, you're trusting software developers either way, whether it be at the app level, the language/runtime level, or the operating system level.
NUMA aware threading is somewhat rare but it does exist.
Its just reaching into the high arts of high-performance that fewer-and-fewer programmers know about. I myself am not an HPC expert, I just like to study this stuff on the side as a hobby.
So NUMA-awareness is when your code knows that &variable1 is located in one physical location, while &variable2 is somewhere else.
This is possible because NUMA-aware allocators (numa_alloc in Linux, VirtualAlloc in Windows) can take parameters that guarantee an allocation within a particular NUMA zone.
Now that you know certain variables are tied together in physical locations, you can also tie threads together with affinity to those same NUMA locations. And with a bit of effort, you can ensure that threads that are in one workpool share the same NUMA zones.
---------
Now code-awareness of shared caches is less common. But following the same models of "abstracted work pools of thread-affinity + NUMA awareness of data", programmers have been able to ensure Zen1 cores to be working together with the same L3 cache.
L2 cache with E-cores is new, but not a new concept in general. (IE: the same mechanisms and abstractions we used for thread-affinity checks on Zen cores sharing L3 cache, or multi-socket CPUs being NUMA Aware... all would still work for L2 cache).
I don't know if the libraries support that. But I bet Intel's library (TBB) and their programmers are working on keeping their abstractions clean and efficient.
> I don't know if the libraries support that. But I bet Intel's library (TBB) and their programmers are working on keeping their abstractions clean and efficient.
Intel can declare in ACPI a set of nodes, the distances between nodes, and then Linux/libnuma/etc pick it up.
So, e.g. in AMD's SLIT tables, the local node is 10; within the same partition are 11; within the same socket are 12; distant sockets are >=20.
There's fancier, more detailed tables (e.g. HMAT) and some code out there that uses them, but it's kind of beyond the scope of libnuma.
> you're trusting software developers either way, whether it be at the app level, the language/runtime level, or the operating system level.
I trust systems to do better based on observed behavior rather than a software engineer's guess of how it will be scheduled. Who knows if, in a given use case, the program is a "small" part of the system or a "large" part that should get preferential placement and scheduling.
> If a threadpool manager is hinted that 4 threads are going to share a lot of memory, they can be allocated on the same l2 cache.
And so this is kind of a weird thing: we know we're going to be performance critical and we need things to be forced to be adjacent... but we don't know the exact details of the hardware we're running on. (Else, just numa_bind and be done...)
The beauty is that you don't care what hardware you run on, all you're annotating are very useful but generic properties such as which threads are sharing a lot of memory, or perhaps that a thread should have highest performance priority so that internally it stays on p cores instead of the more scalable e cores. Very simple optional hints.
> should have highest performance priority so that internally it stays on p cores
Everything will decide that it wants P cores; it's not punished for battery or energy impact, and wants to win over other applications for users to have a better experience with it.
And even if not made in bad faith, it doesn't know what else is running on the system.
Also these decisions tend to be unduly influenced by microbenchmarks and then don't apply to the real system.
> which threads are sharing a lot of memory
But if they're not super active, should the scheduler really change what it's doing? And doesn't the size of that L2 matter? It doesn't matter if e.g. the stuff is going to get churned out before there's a benefit from that sharing.
In the end, if you don't know pretty specific details of the environment you'll run on: what the hardware is like, what loading is like, what data set size is like, and what else will be running on the machine -- it is probably better to leave this decision to the scheduler.
If you do know all those things, and it's worth tuning this stuff in depth-- odds are you're HPC and you know what the machine is like.
To clarify, what gets scheduled is up to the OS or runtime, all you're doing is setting relative priority. If everything is all the same priority, then it's just as likely to all run on e cores.
If you lie about the nature of your application, you'll only hurt performance in this configuration. You're not telling the OS what cores to run on, you're simply giving hints as to how the program behaves. It's no different than telling the threadpool manager how many threads to create or if a thread is long lived. It's a platform agnostic hint to help performance. And remember, this is all optional, just like the threadpool example that already exists in most major languages. Are you going to argue that programs shouldn't have access to core count information on the cpu too? They'll just shoot their foots as you said.
Again, there's already explicit ways for programs to show fine control; this stuff is already declared in ACPI and libnuma and higher level shims exist over it. But generally you want to know both how the entire machine is being used and pretty detailed information about working set sizes before attempting this.
Most things that have tried to set affinities have ended up screwing it up.
There's no need to put an easier user interface on the footgun or to make the footgun cross-platform. These interfaces provide opportunities for small wins (generally <5%) and big losses. If you're in a supercomputing center or a hyperscaler running your own app, this is worth it; if you're writing a DBMS that will run on tens of thousands of dedicated machines, it may be worth it. But usually you don't understand the way you'll be employed well enough to know if this is a win.
In the context of the future of heterogeneous computing, where your standard pc will have thousands of cores of various capabilities and locality, I very much disagree.
> where your standard pc will have thousands of cores
Thousands of non-GPU cores, intended to run normal tasks? I doubt it.
Thousands of special purpose cores running different programs like managing power, managing networks, managing RGB lighting around? Maybe, but that doesn't really benefit from this.
Thousands of cores including GPU cores? What you're talking about in labelling locality isn't sufficient to address this problem, and isn't really even a significant step towards its solution.
CPUs are trending towards heterogenous many core implementations. 16 core was considered server exclusive a few decades ago, now we're at heterogenous 24 core on an Intel 14900k cpu. The biggest limit right now is on the software side, hence my original comment. I wouldn't be surprised if someday the cpu and gpu become combined to overcome the memory wall, with many different types of specialized cores depending on the use case.
The software side is limited, somewhat intrinsically (there tend to be a lot of things we want to do in order--- Amdahl's law wins).
And even when you aren't intrinsically limited by that, optimal placement doesn't reduce contention that much (assuming you're not ping-ponging a single cache line every operation or something dumb like that).
But the hardware side, too: we're not getting transistors that quickly anymore, and we don't want anything too much smaller than an Intel E-core. Even if we stack 3D, all that net wafer area is not cheap and isn't cheapening quickly.
OpenMP, Intel's TBB and other libraries/tools are clearly moving in this direction.
The main issue is that Intel is... well Intel. Even if they write a good library, there's probably 0% chance it'd work well on ARM systems their competitor. (And only a small chance that it'd be optimized for AMD).
------
Microsoft did put a lot of work into ConcRT, but it doesn't look very successful. Its a very clean model of task-based scheduling, but I'm not seeing too much buzz about it or too many blog posts marketing the benefits.
The other problem Intel has is that they are apparently a horrible factional mess of a company. The fact that the P and E cores are completely separate architectures that sometimes don't even agree on what instruction set they are supporting (e.g. avx-512) is kind of crazy.
Intel advertising the fact that their schedulers can keep MS Teams confined to the efficiency cores... what a sad reflection of how bloated Teams is.
We make a single Electron-like app grow, cancer-like, to do everything from messaging and videoconferencing to shared drive browsing and editing, and as a result we have to contain it.
It can run in your browser too.The electron part isn't the bloat but the web part. Web devs keep using framework on top of frameworks and the bloat is endless. Lack of good native UX kits forces devs to use web-based kits. Qt has a nice idea with qml but aside from some limitations, it is mostly C++ (yes, pyqt,etc.. exist).
Native UI kits should be able to do better than web-based kits. But I suspect just as with the web, the problem is consistency. The one thing the web does right is deliver consistent UI experience across various hardware with less dev time. It all comes down to which method has least amounts of friction for devs? Large tech companies spent a lot of time and money in dev tooling for their web services, so web based approaches to solve problems inherently have to be taken for even trivial apps (not that teams is one).
Open source native UX kits that work consistently across platforms and languages would solve much of this. Unfortunately, the open source community is stuck on polishing gtk and qt.
Not just consistency. Microsoft themselves don't have a modern, stable UI toolkit anymore. Linux is a mess. Only MacOS have something decent.
Then there is the fact that with native you need separate native app devs, familiar with tooling and environment, cause they totally different, so costs balloon. A lot. Not to mention a difficulty of hiring, compared to Electron.
There are practical reasons why Electron won, people are just ignorant of those reasons. If they poured their hate into solving the problem instead, we might have something decent already. But it's easier to complain. So here we are.
Personally, I'm annoyed, but understand why it is like it is.
Same here. The write once run everywhere eventually did won but that was the web. And comparatively speaking, JVM is so much better today than it was 20 years ago.
I sometimes wonder if Chromium actually do any specific optimisation for Electrons related usage.
I don't think it was Java itself, but many operating systems simply had a much stronger set of UX that was cohesively being followed.
Yet here we are, in an era where you can encounter multiple choices and you don't know whether it's a single select versus a multi-select tick box. And then there's something with a boolean state, and it's not clear which color means it's currently active. Then you hit alt-F for the File menu order to quit your browser in frustration, but the web page blocks you because it has decided that means that you're going to "Favorite" whatever you're looking at.
> The electron part isn't the bloat but the web part.
The bloat part is the bloat. Web apps can made to be perfectly performant if you are diligent, and native apps can be made to be bloated and slow if you're not.
>But I suspect just as with the web, the problem is consistency.
That is indeed the "problem" at its core. People are lazy, operating systems aren't consistent enough.
Operating systems came about to abstract all the differences of countless hardware away, but that is no longer good enough. Now people want to abstract away that abstraction: Chrome.
Chrome is the abstraction layer to Windows, MacOS, iOS, Linux, Android, BSD, C++, HTML, PHP, Ruby, Rust, Python, desktops, laptops, smartphones, tablets, and all the other things. You develop for Chrome and everyone uses Chrome and everyone on both sides gets the same thing for a singular effort.
If I were to step away from all I know and care about computers and see as an uncaring man, I have to admit: It makes perfect sense. Fuck all that noise. Code for Chrome and use by Chrome. The world can't be simpler.
These days you need a CPU, a GPU, a NPU and a TPU (not Tensor, but Teams Processing Unit).
In my case, the TPU is a seperate Mac that also does Outlook, and the real work gets done on the Linux laptop. I refer to this as the airgap to protect my sanity.
Intel and Windows have been birds of a feather for 40 years, and it frustrates me that to this day Intel is still designing its CPUs around Windows. Yes, Windows is “#1” in market share. It’s just sad that the world accepts the ads in Windows 11, Teams being an electron app, x86 when ARM is the clear winner (at least for laptops), all the other Windows nonsense.
Slightly off topic, but if I'm aiming to get the fastest 'make -jN' for some random C project (such as the kernel) should I set N = #P [threads] + #E, or just the #P, or something else? Basically, is there a case where using the E cores slows a compile down? Or is power management a factor?
I timed it on the single Intel machine I have access to with E-cores and setting N = #P + #E was in fact the fastest, but I wonder if that's a general rule.
On my tests on an AMD Zen 3 (a 5900X), I have determined that with SMT disabled, the best performance is with N+1 threads, where N is the number of cores.
On the other hand, with SMT enabled, the best performance has been obtained with 2N threads, where N is the same as above, i.e. with the same number of threads as supported in hardware.
For example, on a 12C/24T 5900X, it is best to use "make -j13" if SMT is disabled, but "make -j24" if SMT is enabled.
For software project compilation, enabling SMT is always a win, which is not always the case for other applications, i.e. for those where the execution time is dominated by optimized loops.
Similarly, I expect that for the older Meteor Lake, Raptor Lake and Alder Lake CPUs the best compilation speed is achieved with 2 x #P + #E threads, even if this should improve the compilation time by only something like 10% over that achieved with #P + #E threads. At least the published compilation benchmarks are consistent with this expectation.
EDIT:
I see from your other posting that you have used a notation that has confused me, i.e. that by #P you have meant the number of SMT threads that can be executed on P cores, not the number of P cores.
Therefore what I have written as 2 x #P + #E is the same as what you have written as #P + #E.
So your results are the same with what I have obtained on AMD CPUs. With SMT enabled, the optimal number of threads is the total number of threads supported by the CPU, where the SMT cores support multiple threads.
Only with SMT disabled an extra thread over the number of cores becomes beneficial.
The reason for this difference in behavior is that with SMT disabled, any thread that is stalled by waiting for I/O leaves a CPU core idle, slowing the progress. With SMT enabled, when any thread is stalled, the threads are redistributed to balance the workload and no core remains idle. Only 1 of the P cores runs a thread instead of 2, which reduces its throughput by only a few percent. Avoiding this small loss in throughput by running an extra thread does not provide enough additional performance to compensate the additional overhead caused by an extra thread.
Power management is a factor because the cores "steal" power from each other. However the E-cores are more efficient so slowing down P-cores and giving some of their power to the E-cores increases the efficiency and performance of the chip overall. In general you're better off using all the cores.
I suggest this depends on the exact model you are using. On Alder/Raptor Lake, the E-cores run at 4.5GHz which is completely futile, and in doing so they heat their adjacent P-cores, because 2x E-core clusters can easily draw 135W or more. This can significantly cut into the headroom of the nearby P. Arrow Lake-S has rearranged the E-cores.
Did you test at least +1 if not *1.5 or something? I would expect you to occasionally get blocked on disk I/O and would want some spare work sitting hot to switch in.
Your processor has two P cores, and ten cores total, not twelve. The HyperThreading (SMT) does not make the two P cores into four cores. Your experiment with 4 threads will most likely result in using both P cores and two E cores, as no sane OS would double up threads on the P cores before the E cores were full with one thread each.
The hyperthreading should cover up memory latency, since the workload (compiling qemu) might not fit into L3 cache. Although I take your point that it doesn't magically create two core-equivalents.
It's much more than that. It also allows one thread to make progress while the other is waiting for memory loads, or filling in instruction slots while the other thread is recovering from a branch mispredict.
Compilers tend to do a lot of pointer chasing and branching, so it's expected that they would benefit decently from hyperthreading.
What do you consider broken about Lunar Lake? It looks good to me. The E-cores are on a separate island to allow the ring to power off and that does lead to lower performance but it's still good enough IMO.
When the P/E model was introduced by Intel, there was a fairly long transition period where both Windows and Linux performed unpredictably poorly for most compute-intensive work loads, to the point where the advice was to disable the E cores entirely if you were gaming or doing anything remotely CPU-intensive or if your OS was never going to be updated (Win 7/8, many LTS Linux).
It's not entirely clear to me why it took a while to add support on Linux because the kernel already supported big.LITTLE and from the scheduler's point of view it's the same thing as Intel's P/E cores. I guess the patch must've been simple but it just took very long to trickle down to common distributions?
I thought this was actually a good question, so I have trued to look it up. If I understand correctly the CPU tells the scheduler! I could not find exactly how though, maybe an MSR?
Triple decoder is one unique effect. The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO. Very well done.
Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies.
I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).
Big cores (P cores or AMD Zen5) obviously can split into 2 hyperthread, but what if that division is still too big? E cores are 4 threads of support in roughly the same space as 1 Pcore.
This is because L2 cache is shared/consolidated, and other resources (ROP buffers, register files, etc. etc.) are just all so much smaller on the Ecore.
It's an interesting design. I'd still think that growing the cores to 4way SMT (like Xeon Phi) or 8way SMT (POWER10) would be a more conventional way to split up resources though. But obviously I don't work at Intel or can make these kinds of decisions.
> The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO
Not just small loops. It can reach 9x instruction decode on almost any control flow pattern. It just looks at the next 3 branch targets and starts decoding at each of them. As long as there is a branch every 32ish instructions (presumably a taken branch?), Skymont can keep all three uop queues full and Rename/dispatch can consume uops at a sustained rate of 8 uops per cycle.
And in typical code, blocks with more than 32 instructions between branches are somewhat rare.
But Skymont has a brilliant trick for dealing with long runs of branchless code too: It simply inserts dummy branches into the branch predictor, breaking them into shorter blocks that fit into the 32 entry uop queues. The 3 decoders will start decoding the long block at three different positions, leap-frogging over each-other until the entire block is decoded and stuffed into the queues.
This design is absolutely brilliant. It seems to entirely solve the issue decoding X86, with far less resources than a uop cache. I suspect the approach will scale to almost unlimited numbers of decoders running in parallel, shifting the bottlenecks to other parts of the design (branch prediction and everything post decode)
Thanks for the explanation. I was wondering how the heck Intel did to make a 9-way decode x86–a low power core of all things. Seems like an elegant approach.
The important bit: Intel E-cores now have 3x decoders each with the ability for 3-wide decode. When they work as a team, they can perform 9 decodes per clock tick (which then bottlenecks to 8 renamed uops in the best case scenario, and more than likely ~4 or ~3 more typical uops).
3-4 uops per cycle is more of an average throughput than a typical throughput.
The average is dragged down by many cycles that don't decoded/rename any uops. Either waiting for bytes to decode (icache miss, etc) or rename is blocked because the ROB is full (probably stalled on a dcache miss).
So you want a quite wide frontend so that whenever you are unblocked, you can drag the average up again.
While the frontend of Intel Skymont, which includes instruction fetching and decoding, is very original and unlike to that of any other CPU core, the backend of Skymont, which includes the execution units, is extremely similar to that of Arm Cortex-X4 (which is a.k.a. Neoverse V3 in its server variant and as Neoverse V3AE in its automotive variant).
This similarity consists in the fact that both Intel Skymont and Arm Cortex-X4 have the same number of execution units of each kind (and there are many kinds of execution units).
Therefore it can be expected that for any application whose performance is limited by the CPU core backend, the CPU cores Intel Skymont and Arm Cortex-X4 (or Neoverse V3) should have very similar performances.
Moreover, Intel Skymont and Arm Cortex-X4 have the same die area, i.e. around 1.7 square mm (including with both cores 1 MB of L2 cache in this area). Therefore the 2 cores not only should have about the same performance for backend-limited applications, but they also have the same cost.
Before Skymont, all the older Intel Atom cores had been designed to compete with the medium-size Arm Cortex-A7xx cores, even if the Intel Atom cores have always lagged in performance Cortex-A7xx by a year or two. For instance Intel Tremont had a very similar performance to Arm Cortex-A76, while Intel Gracemont and Crestmont have an extremely similar core backend with the series of Cortex-A78 to Cortex-A725 (like Gracemont and Crestmont, the 5 cores in the series Cortex-A78, Cortex-A710, Cortex-A715, Cortex-A720 and Cortex-A725 have only insignificant differences in the execution units).
With Skymont, Intel has made a jump in E-core size, positioning it as a match for Cortex-X, not for Cortex-A7xx, like its predecessors.
>positioning it as a match for Cortex-X
Well the recent Cortex X5 or 925 is already at around 3.4mm2 so that comparison isn't exactly accurate. But I would love to test and see results on Skymont compared to X4. But I dont think they are available yet ( as an individual core ).
I am really looking forward to Clearwater Forest which is Skymont on 18A for Server.
And I know I am going to sound crazy but I wouldn't mind a small SoC based on Skymont and Xe2 Graphics for Smartphone to Tablets.
> Clearwater Forest which is Skymont on 18A for Server.
Clearwater Forest will be using a further generation improved E-core, Darkmont, which will also sit on top of large local caches using Foveros Direct 3D (like AMD's X3D design). [1]
Likely Darkmont will be a server tweaked version of Skymont, but there is no public info yet available.
This is possibly the critical product which will determine if Intel will be viable from a manufacturing and design perspective...If it gets released in the next 6-9 months with good thermals, IPC, and clock speeds, Intel will have a major winner on its hands. If not....
[1] https://www.intel.com/content/dam/www/central-libraries/us/e...
My guess is that Darkmont is simply Skymont with a different name due to being 18A with different cache design.
But I hope Intel prove me wrong and bring in something even more exciting.
Like I have said, Intel Skymont is a very close match for Cortex-X4, not for Cortex-X925.
With Cortex-X925 Arm has made a big jump in core size, departing from the previous Cortex-X series, which has allowed a good increase in IPC, greatly improving the results of single-threaded benchmarks, but this has been paid by a much worse performance per area, making Cortex-X925 completely unsuitable for multi-threaded applications. Therefore Cortex-X925, like also Intel Lion Cove, is useful only when it is accompanied by smaller cores that handle the multi-threaded workloads.
So unlike with previous Arm cores, Cortex-X925 has not made Cortex-X4 obsolete, as demonstrated e.g. in MediaTek Dimensity 9400, which includes 1 Cortex-X925 to get good single-threaded benchmark scores, together with 3 Cortex-X4 to get good multi-threaded benchmark scores.
It is not clear which are the intentions of Arm for the evolution of the Cortex-X series. The rumors are that the next core configuration for smartphones is intended to be like that already deployed by Qualcomm with its custom cores, i.e. to have a big core that is 3 times bigger than the medium-size core and to use 2 big Cortex-X930 cores + 6 medium-size Cortex-A730 cores, for an even split in die area between the big cores and the medium-size cores.
For this to work well, Cortex-X930 must provide a good improvement in performance per area over Cortex-X925, because otherwise it would be hard to justify a 2+6 arrangement, when in the same die area one could have implemented a 1+9 configuration, with the same single-threaded performance, but with better multi-threaded performance.
I believe that a small SoC with only 4 Skymont cores and Xe2 graphics would provide performance, battery lifetime and cost for a smartphone that would be completely competitive with any existing Qualcomm, MediaTek or Samsung SoC.
This would be less obvious in a benchmark like GeekBench 6, where Cortex-X925 or Qualcomm Oryon L would show a greater single-threaded score, but the difference would not be great enough to actually matter in real usage. Also for multi-threaded performance measured by GB6, only 4 Skymont cores would seem to be a little slower than the current flagships, but that would be misleading, because 4 Skymont cores could run at full speed for long durations within the smartphone power constraints, while the current 8-core flagships can never run all 8 cores at the 100% performance recorded by GB6, without overheating after a short time.
An 8-core Skymont SoC would be excellent for a cheap tablet with long battery lifetime and great performance, even if again, such a configuration would be penalized by GB6, which favors having 1 huge core, like Cortex-X925, for the ST score, together with an over-provisioned set of medium-size cores, which can run all together only for the short time required to complete the GB6 sub-benchmarks, but in real prolonged usage must never be all completely busy at the same time, in order to avoid overheating.
Skymont is an improvement but...
Skymont area efficiency should be compared to Zen 5C on 3nm. It has higher IPC, SMT with dual decoders - one for each thread, and full rate AVX-512 execution.
AMD didn't have major difficulties in scaling down their SMT cores to achieve similar performance per area. But Intel went with different approach. At the cost of having different ISA support on each core in consumer devices and having to produce an SMT version of their P cores for servers anyway.
It should be noted that Intel Skymont has the same area and it should also have the same performance for any backend-limited application with Arm Cortex-X4 (a.k.a. Neoverse V3) (both use 1.7 square mm in the "3 nm" TSMC fabrication process, while a Zen 5 compact might have an almost double area in the less dense "4 nm" process, with full vector pipelines, and a 3 square mm area with reduced vector pipelines, in the same less dense process).
Arm Cortex-X4 has the best performance per area of among the cores designed by Arm. Cortex-X925 has a double area in comparison with Cortex-X4, which results in a much lower performance per area. Cortex-A725 is smaller in area, but the area ratio is likely to be smaller than the performance ratio (for most kinds of execution units Cortex-X4 has a double number, while for some it has only a 50% or a 33% advantage), so it is likely that the performance per area of Cortex-A725 is worse than for Cortex-X4 and for Skymont.
For any programs that benefit from vector instructions, Zen 5 compact will have a much better performance per area than Intel Skymont and Arm Cortex-X4.
For programs that execute mostly irregular integer and pointer operations, there are chances for Intel Skymont and Arm Cortex-X4 to achieve better performance per area, but this is uncertain.
Intel Skymont and Arm Cortex-X4 have a greater number of integer/pointer execution units per area than Zen 5 compact, even if Zen 5 compact were made with a TSMC process equally dense, which is not the case today.
Despite that, the execution units of Zen 5 compact will be busy a much higher percentage of the time, for several reasons. Zen 5 is better balanced, it has more resources for ensuring out-of-order and multithreaded execution, it has better cache memories. All these factors result in a higher IPC for Zen 5.
It is not clear whether the better IPC of Zen 5 is enough to compensate its greater area, when performing only irregular integer and pointer operations. Most likely is that in such cases Intel Skymont and Arm Cortex-X4 remain with a small advantage in performance per area, i.e. in performance per dollar, because the advantage in IPC of Zen 5 (when using SMT) may be in the range of 10% to 50%, while the advantage in area of Intel Skymont and Arm Cortex-X4 might be somewhere between 50% and 70%, had they been made with the same TSMC process.
On the other hand, for any program that can be accelerated with vector instructions, Zen 5 compact will crush in performance per area (i.e. in performance per dollar) any core designed by Intel or Arm.
> It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).
Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async
Not... quite. I think you've got the cause-and-effect backwards.
Programmers who happen to write multiple-threaded programs don't need powerful cores, they want more cores. A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.
Programmers who happen to write powerful singled-threaded programs need powerful cores. For example, AMD's "X3D" line of CPUs famously have 96MB of L3 cache, and video games that are on these very-powerful cores have much better performance.
Its not "Programmers should change their code to fit the machine". From Intel's perspective, CPU designers should change their core designs to match the different kinds of programmers. Single-threaded (or low-thread) programmers... largely represented by the Video Game programmers... want P-cores. But not very much of them.
Multithreaded programmers... represented by Graphics and a few others... want E-cores. Splitting a P-core into "only" 2 threads is not sufficient, they want 4x or even 8x more cores. Because there's multiple communities of programmers out there, dedicating design teams to creating entirely different cores is a worthwhile endeavor.
--------
> Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async
Power-efficiency is going to be incredibly difficult moving forward.
It should be noted that E-cores are not very power-efficient though. They're area efficient, IE Cheaper for Intel to make. Intel can sell 4x as many E-cores for roughly the same price/area as 1x P-core.
E-cores are cost-efficient cores. I think they happen to use slightly less power, but I'm not convinced that power-efficiency is their particular design goal.
If your code benefits from cache (ie: big cores), its probable that the lowest power-cost would be to run on large caches (like P-cores or Zen5 or Zen5 X3D). Communicating with RAM is always more power than just communicating with caches after all.
If your code does NOT benefit from cache (ie: Blender regularly has 100GB+ scenes for complex movies), then all of those spare resources on P-cores are useless, as nothing fits anyway and the core will be spending almost all of its time waiting on RAM to do anything. So the E-core will be more power efficient in this case.
> A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.
Is this true? In most of my work I'd usually rather have a single serializable thread of execution. Any parallelism usually comes with added overhead of synchronization, and added mental overhead of having to think about parallel execution. If I could freely pick between 4 IPC worth of single core or 1 IPC per core with 4 cores I'd pretty much always pick a single core. The catch is that we're usually not trading like for like. Meaning I can get 3 IPC worth of single core or 4 IPC spread over 4 cores. Now I suddenly have to weigh the overhead and analyze my options.
Would you ever rather have multiple cores or an equivalent single core? Intuitively it feels like there's some mathematics here.
Indeed a single thread is most simple to reason about, but if you have a single task that can already use 2 cores uniformly, going to 8 cores (assuming enough workload) should be a pretty clean 4x speedup (as long as you don't run into memory bandwidth limits, but that'd cap the single-threaded code too).
But the performance difference between E-core and P-core perf is way less than 4x; the OP article shows a 1.6x/1.7x difference in SPEC for skymont vs lion cove, and 1.3x/1.7x for crestmont vs redwood code; and some searching around for past generations gives numbers around 1.4x.
Increasing core counts being a much more area- and energy-efficient way for hardware to provide more total performance than making the individual cores faster is a pretty fundamental thing.
For stuff like path tracing you have to work very hard not to trash the caches, so you're often just waiting for memory.
That's why such workloads gets a near linear scaling when using hyper-threads, unlike workloads like LLMs which are memory bandwidth bound.
E-cores can typically execute ~4-instructions per clock tick in highly optimized code.
P-cores go up to like... 6-instructions. Better yes, but not dramatically better. The real issue is that P-cores have far more resources than E-cores: deeper reorder buffers to perform more out-of-order execution. Deeper branch prediction, more register files, larger caches, etc. etc.
So P-cores should be hitting the max of 6-instructions per clock tick on more kinds of code. E-cores have much smaller caches (and other resources) so they'll run out and start stalling out to memory-limitations, which is like 0.1 instructions per clock tick or slower.
----------
But guess what? If a fancy P-core is memory-bound (like a lot of Blender code, due to the large-scale dozens+ GBs nature of modern 3d scenes), then those fancy P-cores run out of resources and are 0.1 IPC as well.
If both P-cores and E-cores are stalled out waiting on memory, you'd rather have 32x E-Cores all executing at 0.1 IPC, rather than only 8x P-cores executing at 0.1 IPC.
Its going to be a complex world moving forward: modern CPUs are growing far more complex and its not clear what the tradeoffs will be. But this reality of E-core and P-cores stalling out and waiting on memory is just how modern code works in too many cases.
And remember, its 4x E-cores are equivalent in area/costs to 1x P-core. So there's no contest in terms of overall instructions-per-second for E-core vs P-cores. The E-cores simply are better, even if the individual threads run slower.
Obviously it is easier to write any program as a single sequential thread, because you do not need to think about the dependencies between program statements. When you append a statement, you assume that all previous statements have been already executed, so the new statement can access without worries any data it needs.
The problem is that the speed of a single thread is limited and there exists no chance to increase it by significant amounts.
As long as we will continue to use silicon, there will be negligible increases in clock frequency. Switching to other semiconductors might bring us a double clock frequency in 10 years from now, but there will never be again a decade like that from 1993 to 2003, when the clock frequencies have increased 50 times.
The slow yearly increase in instructions per clock cycle is obtained by making the hardware do more and more of the work that has not been done by the programmer or the compiler, i.e. by extracting from the sequential program the separate chains of dependent instructions that should have been written as distinct threads, in order to execute them concurrently.
This division of a single instruction sequence into separate threads is extremely inefficient when done at runtime by hardware. Because of this the CPU cores with very high IPC have lower and lower performance per area and per power with the increase of the IPC. Low performance per area and per power means low multithreaded performance.
So the CPU cores with very good single-threaded performance, like Intel Lion Cove or Arm Cortex-X925 have very poor multi-threaded performance and using many of them in a CPU would be futile, because in the same limited area one could put many more small CPU cores, achieving a much higher total performance.
This is why such big CPU cores that are good for single-threaded applications must be paired with smaller CPU cores, like Intel Skymont or Arm Cortex-X4, in order to obtain a good multi-threaded performance.
Writing the program as a single thread is easy and of course it should always be done so when the achieved performance is good enough on the current big superscalar CPU cores.
On the other hand, whenever the performance is not sufficient, there is no way to increase it a lot otherwise than by decomposing the work that must be done into multiple concurrent activities.
The easy case is that of iterations, which frequently provide large amounts of work that can be done concurrently. Moreover, with iterations there are many tools that can create concurrent threads automatically, like OpenMP or NVIDIA CUDA.
Where there are no iterations, one may need to do much more work to identify the dependencies between activities, in order to determine which may be executed concurrently, because they do not have functional dependencies between them.
However, when an entire program consists of a single chain of dependent instructions, which may happen e.g. when computing certain kinds of hash functions over a file, you are doomed. There is no way to increase the performance of that program.
Nevertheless even in such cases one can question whether the specification of the program is truly what the end user needs. For instance, when computing a hash over a file, the actual goal is normally not the computation of the hash, but to verify whether the file is the same as another (where the other file may be a past version of the same file, to detect modification, or an apparently distinct file coming from another source, when deduplication is desired). In such cases, it does not really matter which hash function is used, so it may be acceptable to replace the hash algorithm with another that allows concurrent computation, solving the performance problem.
Similar reformulations of the problem that must be solved may help in other cases where initially it appears that it is not possible to decompose the workload into concurrent tasks.
> However, when an entire program consists of a single chain of dependent instructions, which may happen e.g. when computing certain kinds of hash functions over a file, you are doomed. There is no way to increase the performance of that program.
Even in that case, you would probably benefit from having many cores because the user is probably running other things on the same machine, or the program is running on a runtime with eg garbage collector threads etc. I’d venture it’s quite rare that the entire machine is waiting on a single sequential task!
> I’d venture it’s quite rare that the entire machine is waiting on a single sequential task!
But that happens all the time in video game code.
Video games may have many threads running, but there's usually a single-thread bottleneck. To the point that P-cores and massively huge Zen5 cores are so much better for video games.
Javascript (ie: rendering webpages) is single-threaded bound, which is probably why the Phone makers have focused so much on making bigger cores as well. Yes, there's plenty of opportunities for parallelism in web browsers and webpages. But most of the work is in the main Javascript thread called at the root.
>A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.
Nah, Blender programmer will prefer one core with AVX-512 instead of 4 without it.
That's just more parallelism, they'll take their parallelism wherever they can get it.
It's to be seen if the future is more SIMD or more smaller general processors. Arguably the latter are more flexible but maybe not as efficientas the former.
I mean, eventually yeah.
I like Zen5 as much as the next guy, but it should be noted that even today's most recent version of Blender is AVX (256-bit) only. That means E-cores remain the optimal core to work with for a lot of Blender stuff.
Hopefully AMD Zen5 AVX512 becomes more popular. Maybe it'd become more popular as Intel rolls out AVX10 (somewhat compatible instruction set)
Would blender benefit from the bits of AVX-512 other than the width? I would think the approximate sqrt instructions might be useful.
AVX512 is one of the best instruction sets I've seen. No joke.
There's all kinds of things AVX512 would help out in Blender. But those ways are incompatible with older AVX2 or SSE code. The question is if Blender will be willing to support SSE, AVX, and AVX512 code paths. Each new codepath is more maintenance and more effort.
AVX512 has more registers: not just 32x 512-bit registers (AVX normally has 16x 256-bit registers), but also the kmask registers (64-bits that take the place of old boolean logic that used to be done on the 256-bit registers). This alone should give far more optimizations for the compiler to automatically find.
There's also VPCOMPRESSB and VPEXPANDB, Conflict-detection, and other instructions that make new SIMD data-structures far more efficient to implement. But this requires deep understanding that very few programmers have yet.
> A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.
Don’t they really want GPU threads for that? You wouldn’t get by with just weaker cores.
Cloth physics in Blender are stored in RAM (as scenes and models can grow very large, too large for a GPU).
Figuring out which verticies for a physics simulation need to be sent to the GPU would be time, effort, and PCIe traffic _NOT_ running the cloth physics.
Furthermore, once all the data is figured out and blocked out in cache, its... cached. Cloth physics only interacts with a small number of close, nearby objects. Yeah, you _could_ calculate this small portion and send it to the GPU, but CPU is really good at just automatically traversing trees and storing the most recently used stuff in L1, L2, and L3 caches automatically (without any need of special code).
All in all, I expect something like Cloth physics (which is a calculation Blender currently does on CPU-only), is best done CPU only. Not because GPUs are bad at this algorithm... but instead because PCIe transfers are too slow and cloth physics is just too easily cached / benefited by various CPU features.
It'd be a lot of effort to translate all that code to GPU and you likely won't get large gains (like Raytracing/Cycles/Rendering gets for GPU Compute).
NVIDIA's physX has its own cloth physics abstractions: https://docs.nvidia.com/gameworks/content/gameworkslibrary/p..., so I'm sure it is a thing we do on GPUs already, if only for games. These are old demos anyways:
https://www.youtube.com/watch?v=80vKqJSAmIc
I wonder what the difference is between the cloth physics you are talking about and the one NVIDIA has been doing for I think more than a decade now? Is it scale? It sounds like, at least, there are alternatives that do it on the GPU and there are questions if Blender will do it on the GPU:
https://blenderartists.org/t/any-plans-to-make-cloth-simulat...
Cloth / Hair physics in those games were graphics-only physics.
They could collide with any mesh that was inside of the GPU's memory. But those calculations cannot work on any information stored on CPU RAM. Well... not efficiently anyway.
---------
When the Cloth simulator in Blender runs, it generates all kinds of information the CPU needs for other steps. In effect, Blender's cloth physics serves as an input to animation frames, which is all CPU-side information.
Again: i know cloth physics executes on GPUs very well in isolation. But I'd be surprised if BLENDER's specific cloth physics would ever be efficient on a GPU. Because as it turns out, calculations kind of don't matter in the big-picture. There's a lot of other things you need to do after those calculations (animations, key frames, and other such interactions). And if all that information is stored randomly in 100GB of CPU RAM, it'd be very hard to untangle that data and get it to a GPU (and back).
In a Video Game PHYSX setting, you just display the cloth physics to the screen. In Blender, a 3d animation program, you have to do a lot more with all that information and touch many other data-structures.
PCIe is very slow compared to RAM.
> I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).
AMD did something similar before. Anyone don't remember the Bulldozer cores sharing resources between pair of cores ?
In Bulldozer, two cores shared a single Floating Point Unit. A bet was made made that majority calculations will be done in integers. Unfortunately, actual world requires decimal places, and languages like Javascript only have floating point numbers. Bulldozer flopped hard.
AMD's modern SMT implementation is just better than Bulldozer's decoder sharing.
Modern Zen5 has very good SMT implementations. Back 10 years ago, people mostly talked about Bulldozer's design to make fun of it. It always was intriguing to me but it just never seemed to be practical in any workflow.
I see "ROP" and immediately think of Return Oriented Programming and exploits...
Lulz, I got some wires crossed. The CPU resource I meant to say was ROB: Reorder Buffer.
I don't know why I wrote ROP. You're right, ROP means return oriented programming. A completely different thing.
"Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies."
@dragontamer solid point. Consider a in memory ring shared between two threads. There's huge difference in throughput and latency if the threads share L2 (on same core) or when on different cores all down to the relative slowness of L3.
Are there other cpus (arm, graviton?) that have similarly shared L2 caches?
Hyperthreading actually shares L1 caches between two threads (after all, two threads are running in the same L1 cache and core).
I believe SMT4 and SMT8 cores from IBM Power10 also have L1 caches shared (8 threads on one core), and thus benefit from communication speeds.
But you're right in that this is a very obscure performance quirk. I'm honestly not aware of any practical code that takes advantage of this. E-cores are perhaps the most "natural" implementation of high-speed core-to-core communications though.
What we desperately need before we get too deep into this is stronger support in languages for heterogeneous cores in an architecture agnostic way. Some way to annotate that certain threads should run on certain types of cores (and close together in memory hierarchy) without getting too deep into implementation details.
I don't think so. I don't trust software authors to make the right choice, and the most tilted examples of where a core will usually need a bigger core can afford to wait for the scheduler to figure it out.
And if you want to be close together in the memory hierarchy, does that mean close to the RAM that you can easily get to? And you want allocations from there? If you really want that, you can use numa(3).
> without getting too deep into implementation details.
Every microarchitecture is a special case about what you win by being close to things, and how it plays with contention and distances to other things. You either don't care and trust the infrastructure, or you want to micromanage it all, IMO.
I'm talking about close together in the cache. If a threadpool manager is hinted that 4 threads are going to share a lot of memory, they can be allocated on the same l2 cache. And no matter what, you're trusting software developers either way, whether it be at the app level, the language/runtime level, or the operating system level.
NUMA aware threading is somewhat rare but it does exist.
Its just reaching into the high arts of high-performance that fewer-and-fewer programmers know about. I myself am not an HPC expert, I just like to study this stuff on the side as a hobby.
So NUMA-awareness is when your code knows that &variable1 is located in one physical location, while &variable2 is somewhere else.
This is possible because NUMA-aware allocators (numa_alloc in Linux, VirtualAlloc in Windows) can take parameters that guarantee an allocation within a particular NUMA zone.
Now that you know certain variables are tied together in physical locations, you can also tie threads together with affinity to those same NUMA locations. And with a bit of effort, you can ensure that threads that are in one workpool share the same NUMA zones.
---------
Now code-awareness of shared caches is less common. But following the same models of "abstracted work pools of thread-affinity + NUMA awareness of data", programmers have been able to ensure Zen1 cores to be working together with the same L3 cache.
L2 cache with E-cores is new, but not a new concept in general. (IE: the same mechanisms and abstractions we used for thread-affinity checks on Zen cores sharing L3 cache, or multi-socket CPUs being NUMA Aware... all would still work for L2 cache).
I don't know if the libraries support that. But I bet Intel's library (TBB) and their programmers are working on keeping their abstractions clean and efficient.
> I don't know if the libraries support that. But I bet Intel's library (TBB) and their programmers are working on keeping their abstractions clean and efficient.
Intel can declare in ACPI a set of nodes, the distances between nodes, and then Linux/libnuma/etc pick it up.
So, e.g. in AMD's SLIT tables, the local node is 10; within the same partition are 11; within the same socket are 12; distant sockets are >=20.
There's fancier, more detailed tables (e.g. HMAT) and some code out there that uses them, but it's kind of beyond the scope of libnuma.
> you're trusting software developers either way, whether it be at the app level, the language/runtime level, or the operating system level.
I trust systems to do better based on observed behavior rather than a software engineer's guess of how it will be scheduled. Who knows if, in a given use case, the program is a "small" part of the system or a "large" part that should get preferential placement and scheduling.
> If a threadpool manager is hinted that 4 threads are going to share a lot of memory, they can be allocated on the same l2 cache.
And so this is kind of a weird thing: we know we're going to be performance critical and we need things to be forced to be adjacent... but we don't know the exact details of the hardware we're running on. (Else, just numa_bind and be done...)
The beauty is that you don't care what hardware you run on, all you're annotating are very useful but generic properties such as which threads are sharing a lot of memory, or perhaps that a thread should have highest performance priority so that internally it stays on p cores instead of the more scalable e cores. Very simple optional hints.
> should have highest performance priority so that internally it stays on p cores
Everything will decide that it wants P cores; it's not punished for battery or energy impact, and wants to win over other applications for users to have a better experience with it.
And even if not made in bad faith, it doesn't know what else is running on the system.
Also these decisions tend to be unduly influenced by microbenchmarks and then don't apply to the real system.
> which threads are sharing a lot of memory
But if they're not super active, should the scheduler really change what it's doing? And doesn't the size of that L2 matter? It doesn't matter if e.g. the stuff is going to get churned out before there's a benefit from that sharing.
In the end, if you don't know pretty specific details of the environment you'll run on: what the hardware is like, what loading is like, what data set size is like, and what else will be running on the machine -- it is probably better to leave this decision to the scheduler.
If you do know all those things, and it's worth tuning this stuff in depth-- odds are you're HPC and you know what the machine is like.
To clarify, what gets scheduled is up to the OS or runtime, all you're doing is setting relative priority. If everything is all the same priority, then it's just as likely to all run on e cores.
And then, what's the point?
A system that encourage everyone to jack everything up is pointless.
A system to tell the OS that the developer anticipates that data is shared and super hot will be mostly lied to (on accident or purpose).
There's the edge cases: database servers, HPC, etc, where you believe that the system has a sole occupant that can predict loading.
But libnuma, and the underlying ACPI SRAT/SLIT/HMAT tables are a pretty good fit for these use cases.
If you lie about the nature of your application, you'll only hurt performance in this configuration. You're not telling the OS what cores to run on, you're simply giving hints as to how the program behaves. It's no different than telling the threadpool manager how many threads to create or if a thread is long lived. It's a platform agnostic hint to help performance. And remember, this is all optional, just like the threadpool example that already exists in most major languages. Are you going to argue that programs shouldn't have access to core count information on the cpu too? They'll just shoot their foots as you said.
Again, there's already explicit ways for programs to show fine control; this stuff is already declared in ACPI and libnuma and higher level shims exist over it. But generally you want to know both how the entire machine is being used and pretty detailed information about working set sizes before attempting this.
Most things that have tried to set affinities have ended up screwing it up.
There's no need to put an easier user interface on the footgun or to make the footgun cross-platform. These interfaces provide opportunities for small wins (generally <5%) and big losses. If you're in a supercomputing center or a hyperscaler running your own app, this is worth it; if you're writing a DBMS that will run on tens of thousands of dedicated machines, it may be worth it. But usually you don't understand the way you'll be employed well enough to know if this is a win.
In the context of the future of heterogeneous computing, where your standard pc will have thousands of cores of various capabilities and locality, I very much disagree.
> where your standard pc will have thousands of cores
Thousands of non-GPU cores, intended to run normal tasks? I doubt it.
Thousands of special purpose cores running different programs like managing power, managing networks, managing RGB lighting around? Maybe, but that doesn't really benefit from this.
Thousands of cores including GPU cores? What you're talking about in labelling locality isn't sufficient to address this problem, and isn't really even a significant step towards its solution.
CPUs are trending towards heterogenous many core implementations. 16 core was considered server exclusive a few decades ago, now we're at heterogenous 24 core on an Intel 14900k cpu. The biggest limit right now is on the software side, hence my original comment. I wouldn't be surprised if someday the cpu and gpu become combined to overcome the memory wall, with many different types of specialized cores depending on the use case.
The software side is limited, somewhat intrinsically (there tend to be a lot of things we want to do in order--- Amdahl's law wins).
And even when you aren't intrinsically limited by that, optimal placement doesn't reduce contention that much (assuming you're not ping-ponging a single cache line every operation or something dumb like that).
But the hardware side, too: we're not getting transistors that quickly anymore, and we don't want anything too much smaller than an Intel E-core. Even if we stack 3D, all that net wafer area is not cheap and isn't cheapening quickly.
OpenMP, Intel's TBB and other libraries/tools are clearly moving in this direction.
The main issue is that Intel is... well Intel. Even if they write a good library, there's probably 0% chance it'd work well on ARM systems their competitor. (And only a small chance that it'd be optimized for AMD).
------
Microsoft did put a lot of work into ConcRT, but it doesn't look very successful. Its a very clean model of task-based scheduling, but I'm not seeing too much buzz about it or too many blog posts marketing the benefits.
The other problem Intel has is that they are apparently a horrible factional mess of a company. The fact that the P and E cores are completely separate architectures that sometimes don't even agree on what instruction set they are supporting (e.g. avx-512) is kind of crazy.
AMD had Bulldozer and Bobcat back in the day. Two teams with two different goals is fine, as long as they work together at the end.
And P-cores and E-cores do seem like they are working together well in the "Ultra" series.
using AMD in the bulldozer era as a comparison to Intel is a really bad sign.
Intel advertising the fact that their schedulers can keep MS Teams confined to the efficiency cores... what a sad reflection of how bloated Teams is.
We make a single Electron-like app grow, cancer-like, to do everything from messaging and videoconferencing to shared drive browsing and editing, and as a result we have to contain it.
It can run in your browser too.The electron part isn't the bloat but the web part. Web devs keep using framework on top of frameworks and the bloat is endless. Lack of good native UX kits forces devs to use web-based kits. Qt has a nice idea with qml but aside from some limitations, it is mostly C++ (yes, pyqt,etc.. exist).
Native UI kits should be able to do better than web-based kits. But I suspect just as with the web, the problem is consistency. The one thing the web does right is deliver consistent UI experience across various hardware with less dev time. It all comes down to which method has least amounts of friction for devs? Large tech companies spent a lot of time and money in dev tooling for their web services, so web based approaches to solve problems inherently have to be taken for even trivial apps (not that teams is one).
Open source native UX kits that work consistently across platforms and languages would solve much of this. Unfortunately, the open source community is stuck on polishing gtk and qt.
Not just consistency. Microsoft themselves don't have a modern, stable UI toolkit anymore. Linux is a mess. Only MacOS have something decent.
Then there is the fact that with native you need separate native app devs, familiar with tooling and environment, cause they totally different, so costs balloon. A lot. Not to mention a difficulty of hiring, compared to Electron.
There are practical reasons why Electron won, people are just ignorant of those reasons. If they poured their hate into solving the problem instead, we might have something decent already. But it's easier to complain. So here we are.
Personally, I'm annoyed, but understand why it is like it is.
There are a bunch of new native UI toolkit as well, such as Slint [https://slint.dev]
Sometimes I find myself wistfully thinking of Java Swing applets with native theme settings.
Same here. The write once run everywhere eventually did won but that was the web. And comparatively speaking, JVM is so much better today than it was 20 years ago.
I sometimes wonder if Chromium actually do any specific optimisation for Electrons related usage.
They say nostalgia is always deceptive, but I miss ~java2 era UX.
I don't think it was Java itself, but many operating systems simply had a much stronger set of UX that was cohesively being followed.
Yet here we are, in an era where you can encounter multiple choices and you don't know whether it's a single select versus a multi-select tick box. And then there's something with a boolean state, and it's not clear which color means it's currently active. Then you hit alt-F for the File menu order to quit your browser in frustration, but the web page blocks you because it has decided that means that you're going to "Favorite" whatever you're looking at.
> The electron part isn't the bloat but the web part.
The bloat part is the bloat. Web apps can made to be perfectly performant if you are diligent, and native apps can be made to be bloated and slow if you're not.
>But I suspect just as with the web, the problem is consistency.
That is indeed the "problem" at its core. People are lazy, operating systems aren't consistent enough.
Operating systems came about to abstract all the differences of countless hardware away, but that is no longer good enough. Now people want to abstract away that abstraction: Chrome.
Chrome is the abstraction layer to Windows, MacOS, iOS, Linux, Android, BSD, C++, HTML, PHP, Ruby, Rust, Python, desktops, laptops, smartphones, tablets, and all the other things. You develop for Chrome and everyone uses Chrome and everyone on both sides gets the same thing for a singular effort.
If I were to step away from all I know and care about computers and see as an uncaring man, I have to admit: It makes perfect sense. Fuck all that noise. Code for Chrome and use by Chrome. The world can't be simpler.
These days you need a CPU, a GPU, a NPU and a TPU (not Tensor, but Teams Processing Unit).
In my case, the TPU is a seperate Mac that also does Outlook, and the real work gets done on the Linux laptop. I refer to this as the airgap to protect my sanity.
Teams is technically WebView2 and not Electron… with a bunch of native platform code.
It’s amazing what computers can do today and how absolutely inefficient they are used.
Intel and Windows have been birds of a feather for 40 years, and it frustrates me that to this day Intel is still designing its CPUs around Windows. Yes, Windows is “#1” in market share. It’s just sad that the world accepts the ads in Windows 11, Teams being an electron app, x86 when ARM is the clear winner (at least for laptops), all the other Windows nonsense.
Slightly off topic, but if I'm aiming to get the fastest 'make -jN' for some random C project (such as the kernel) should I set N = #P [threads] + #E, or just the #P, or something else? Basically, is there a case where using the E cores slows a compile down? Or is power management a factor?
I timed it on the single Intel machine I have access to with E-cores and setting N = #P + #E was in fact the fastest, but I wonder if that's a general rule.
On my tests on an AMD Zen 3 (a 5900X), I have determined that with SMT disabled, the best performance is with N+1 threads, where N is the number of cores.
On the other hand, with SMT enabled, the best performance has been obtained with 2N threads, where N is the same as above, i.e. with the same number of threads as supported in hardware.
For example, on a 12C/24T 5900X, it is best to use "make -j13" if SMT is disabled, but "make -j24" if SMT is enabled.
For software project compilation, enabling SMT is always a win, which is not always the case for other applications, i.e. for those where the execution time is dominated by optimized loops.
Similarly, I expect that for the older Meteor Lake, Raptor Lake and Alder Lake CPUs the best compilation speed is achieved with 2 x #P + #E threads, even if this should improve the compilation time by only something like 10% over that achieved with #P + #E threads. At least the published compilation benchmarks are consistent with this expectation.
EDIT: I see from your other posting that you have used a notation that has confused me, i.e. that by #P you have meant the number of SMT threads that can be executed on P cores, not the number of P cores.
Therefore what I have written as 2 x #P + #E is the same as what you have written as #P + #E.
So your results are the same with what I have obtained on AMD CPUs. With SMT enabled, the optimal number of threads is the total number of threads supported by the CPU, where the SMT cores support multiple threads.
Only with SMT disabled an extra thread over the number of cores becomes beneficial.
The reason for this difference in behavior is that with SMT disabled, any thread that is stalled by waiting for I/O leaves a CPU core idle, slowing the progress. With SMT enabled, when any thread is stalled, the threads are redistributed to balance the workload and no core remains idle. Only 1 of the P cores runs a thread instead of 2, which reduces its throughput by only a few percent. Avoiding this small loss in throughput by running an extra thread does not provide enough additional performance to compensate the additional overhead caused by an extra thread.
Power management is a factor because the cores "steal" power from each other. However the E-cores are more efficient so slowing down P-cores and giving some of their power to the E-cores increases the efficiency and performance of the chip overall. In general you're better off using all the cores.
I suggest this depends on the exact model you are using. On Alder/Raptor Lake, the E-cores run at 4.5GHz which is completely futile, and in doing so they heat their adjacent P-cores, because 2x E-core clusters can easily draw 135W or more. This can significantly cut into the headroom of the nearby P. Arrow Lake-S has rearranged the E-cores.
Did you test at least +1 if not *1.5 or something? I would expect you to occasionally get blocked on disk I/O and would want some spare work sitting hot to switch in.
Let me test that now. Note I only have 1 Intel machine so any results are very specific to this laptop.
Machine: 13th Gen Intel(R) Core(TM) i7-1365U; 2 x P-cores (4 threads), 8 x E-coresYour processor has two P cores, and ten cores total, not twelve. The HyperThreading (SMT) does not make the two P cores into four cores. Your experiment with 4 threads will most likely result in using both P cores and two E cores, as no sane OS would double up threads on the P cores before the E cores were full with one thread each.
I am sure rwmj was smart enough to use `taskset` to make this experiment meaningful.
Hehe, if only :-( However I do want to know what's best with the default Linux scheduler and just using 'make' rather than more complicated commands.
The hyperthreading should cover up memory latency, since the workload (compiling qemu) might not fit into L3 cache. Although I take your point that it doesn't magically create two core-equivalents.
“Hyperthreading” is a write pipe hack.
If the core stalls on a write then the other thread gets run.
It's much more than that. It also allows one thread to make progress while the other is waiting for memory loads, or filling in instruction slots while the other thread is recovering from a branch mispredict.
Compilers tend to do a lot of pointer chasing and branching, so it's expected that they would benefit decently from hyperthreading.
Sounds like a nice core, but intel's gone from "fries itself" broken in Raptor Lake to merely a broken cache/memory archecture in Lunar Lake.
None of the integer improvements make it onto the screen, as it were.
edit: I got Lunar and Arrow Lake's issues mixed up, but still there should be a lot more integer uplift on paper.
What do you consider broken about Lunar Lake? It looks good to me. The E-cores are on a separate island to allow the ring to power off and that does lead to lower performance but it's still good enough IMO.
I still don’t quite get how the cpu knows what is low priority or background. Or is that steered at OS level a bit like cpu pinning ?
When the P/E model was introduced by Intel, there was a fairly long transition period where both Windows and Linux performed unpredictably poorly for most compute-intensive work loads, to the point where the advice was to disable the E cores entirely if you were gaming or doing anything remotely CPU-intensive or if your OS was never going to be updated (Win 7/8, many LTS Linux).
It's not entirely clear to me why it took a while to add support on Linux because the kernel already supported big.LITTLE and from the scheduler's point of view it's the same thing as Intel's P/E cores. I guess the patch must've been simple but it just took very long to trickle down to common distributions?
Not very surprisingly but IME running VMs you still want to pin (at least on Linux).
https://en.m.wikipedia.org/wiki/Scheduling_(computing)
Right so how does the scheduler know what’s low priority?
I thought this was actually a good question, so I have trued to look it up. If I understand correctly the CPU tells the scheduler! I could not find exactly how though, maybe an MSR?
(2024)