I think this is an important step, but it skips over that 'fault tolerant routing architecture' means you're spending die space on routes vs transistors. This is exactly analogous to using bits in your storage for error correcting vs storing data.
That said, I think they do a great job of exploiting this technique to create a "larger"[1] chip. And like storage it benefits from every core is the same and you don't need to get to every core directly (pin limiting).
In the early 2000's I was looking at a wafer scale startup that had the same idea but they were applying it to an FPGA architecture rather than a set of tensor units for LLMs. Nearly the exact same pitch, "we don't have to have all of our GLUs[2] work because the built in routing only uses the ones that are qualified." Xilinx was still aggressively suing people who put SERDES ports on FPGAs so they were pin limited overall but the idea is sound.
While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage. I appreciate the the amount of money people are willing to put at risk here allow for folks to try these "out of the box" kinds of ideas.
[1] It is physically more cores on a single die but the overall system is likely smaller, given the integration here.
[2] "Generic Logic Unit" which was kind of an extended LUT with some block RAM and register support.
Of course many people are going to collectively lose trillions, AI's a very highly hyped industry with people racing into it without an intellectual edge and any temporary achievement by any one company will be quickly replicated and undercut by another using the same tools. Economic success of the individuals swarming on a new technology is not a guarantee whatsoever, nor is it an indicator of the impact of the technology.
Just like the dotcom bubble, AI is gonna hit, make a few companies stinking rich, and make the vast majority (of both AI-chasing and legacy) companies bankrupt. And it's gonna rewire the way everything else operates too.
Any thoughts on why they are disabling so many cores in their current product? I did some quick noodling based on the 46/970000 number and the only way I ended up close to 900,000 was by assuming that an entire row or column would be disabled if any core within it was faulty. But doing that gave me a ~6% yield as most trials had active core counts in the high 800,000s
So they massively reduce the area lost to defects per wafer, from 361 to 2.2 square mm. But from the figures in this blog, this is massively outweighed by the fact that they only get 46222 sq mm useable area out of the wafer, as opposed to 56247 that the H100 gets - because they are using a single square die instead of filling the circular wafer with smaller square dies, they lose 10,025 sq mm!
Not sure how that's a win.
Unless the rest of the wafer is useable for some other customer?
An H100 has a TDP of 700 watts (for the SXM5 version). With a die size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the same power density, that means a cerebras TDP of 37.8 kW.
That's a lot. Let's say you cover the whole die area of the chip with water 1 cm deep. How long would it take to boil the water starting from room temperature (20 degrees C)?
amount of water = (die area of 46225 mm^2) * (1 cm deep) * (density of water) = 462 grams
energy needed = (specific heat of water) * (80 kelvin difference) * (462 grams) = 154 kJ
time = 154 kJ / 39.8 kW = 3.9 seconds
This thing will boil (!) a centimeter of water in 4 seconds. A typical consumer water cooler radiator would reduce the temperature of the coolant water by only 10-15 C relative to ambient, and wouldn't like it (I presume) if you pass in boiling water. To use water cooling you'd need some extreme flow rate and a big rack of radiators, right? I don't really know. I'm not even sure if that would work. How do you cool a chip at this power density?
I live in a small city/large town that has a large number of craft breweries. I always marveled at how these small operations were able to churn out so many different varieties. Turns out they are actually trying to make their few core recipes but the yield is so low they market the less consistent results as...all that variety I was so impressed with.
To summarize: localize defect contamination to a very small unit size, by making the cores tiny and redundant.
Analogous to a conglomerate wrapping each business vertical in a limited liability veil so that lawsuits and bankruptcy do not bring down the whole company. The smaller the subsidiaries, the less defect contamination but also the less scope for frictionless resource and information sharing.
> Second, a cluster of defects could overwhelm fault tolerant areas and disable the whole chip.
That’s an interesting point. In architecture class (which was basic and abstract so I’m sure Cerebras is doing something much more clever), we learned that defects cluster, but this is a good thing. A bunch of defects clustering on one core takes out the core, a bunch of defects not clustering could take out… a bunch of cores, maybe rendering the whole chip useless.
I wonder why they don’t like clustering. I could imagine in a network of little cores, maybe enough defects clustered on the network could… sort of overwhelm it, maybe?
Also I wonder how much they benefit from being on one giant wafer. It is definitely cool as hell. But could chiplets eat away at their advantage?
TSMC also have a manufacturing process used by Tesla's Dojo where you can cut up the chips, throw away the defective ones, and then reassemble working ones into a sort of wafer scale device (5x5 chips for Dojo). Seems like a more logical design to me.
I assume people are aware, but Cerebras has a web demo and API which is open to try and it is 2000 tokens per second for Llama 3.3 70b and 1000 tokens per second for Llama 3.1 405b.
Understanding that there's inherent bias by them being competitors of the other companies, but still this article seems to make some stretches. If you told me you had an 8% core defect rate reduced 100x, I'd assume you got to close to 99% enablement. The table at the end shows... Otherwise.
They also keep flipping between cores, SMs, dies, and maybe other block sizes. At the end of the day I'm not very impressed. They seemingly have marginally better yields despite all that effort.
Ever heard the old joke story about an American buyer told the Japanese manufacture how many incorrectly made bolts were acceptable per lot of a thousand bolts? Maybe 2 or 3 in 1,000?
So the Japanese didn't have any incorrectly made bolts in their manufacturing process so they just added two or three bad ones to every batch to please the Americans.
Very interesting. Am I correct in saying that fault tolerance here is with respect to 'static' errors that occur during manufacturing and are straightforward to detect before reaching the customer? Or can these failures potentially occur later on (and be tolerated) during the normal life of the chip?
How do these much smaller cores compare in computing power to the bigger ones? They seem to implicitly claim that a core is a core is a core, but surely one gets something extra out of the much bigger one?
Looking at the H100 on the left, why is the chip yield (72) based on a circular layout/constraint? Why do they discard all of the other chips that fall outside the circle?
I think this is an important step, but it skips over that 'fault tolerant routing architecture' means you're spending die space on routes vs transistors. This is exactly analogous to using bits in your storage for error correcting vs storing data.
That said, I think they do a great job of exploiting this technique to create a "larger"[1] chip. And like storage it benefits from every core is the same and you don't need to get to every core directly (pin limiting).
In the early 2000's I was looking at a wafer scale startup that had the same idea but they were applying it to an FPGA architecture rather than a set of tensor units for LLMs. Nearly the exact same pitch, "we don't have to have all of our GLUs[2] work because the built in routing only uses the ones that are qualified." Xilinx was still aggressively suing people who put SERDES ports on FPGAs so they were pin limited overall but the idea is sound.
While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage. I appreciate the the amount of money people are willing to put at risk here allow for folks to try these "out of the box" kinds of ideas.
[1] It is physically more cores on a single die but the overall system is likely smaller, given the integration here.
[2] "Generic Logic Unit" which was kind of an extended LUT with some block RAM and register support.
Of course many people are going to collectively lose trillions, AI's a very highly hyped industry with people racing into it without an intellectual edge and any temporary achievement by any one company will be quickly replicated and undercut by another using the same tools. Economic success of the individuals swarming on a new technology is not a guarantee whatsoever, nor is it an indicator of the impact of the technology.
Just like the dotcom bubble, AI is gonna hit, make a few companies stinking rich, and make the vast majority (of both AI-chasing and legacy) companies bankrupt. And it's gonna rewire the way everything else operates too.
> Xilinx was still aggressively suing people who put SERDES ports on FPGAs
This so isn't important to your overall point, but where would I begin to look into this? Sounds fascinating!
Any thoughts on why they are disabling so many cores in their current product? I did some quick noodling based on the 46/970000 number and the only way I ended up close to 900,000 was by assuming that an entire row or column would be disabled if any core within it was faulty. But doing that gave me a ~6% yield as most trials had active core counts in the high 800,000s
"While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage"
Can you please explain more why you think so ?
Thank you.
So they massively reduce the area lost to defects per wafer, from 361 to 2.2 square mm. But from the figures in this blog, this is massively outweighed by the fact that they only get 46222 sq mm useable area out of the wafer, as opposed to 56247 that the H100 gets - because they are using a single square die instead of filling the circular wafer with smaller square dies, they lose 10,025 sq mm!
Not sure how that's a win.
Unless the rest of the wafer is useable for some other customer?
Neat. What about power density?
An H100 has a TDP of 700 watts (for the SXM5 version). With a die size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the same power density, that means a cerebras TDP of 37.8 kW.
That's a lot. Let's say you cover the whole die area of the chip with water 1 cm deep. How long would it take to boil the water starting from room temperature (20 degrees C)?
amount of water = (die area of 46225 mm^2) * (1 cm deep) * (density of water) = 462 grams
energy needed = (specific heat of water) * (80 kelvin difference) * (462 grams) = 154 kJ
time = 154 kJ / 39.8 kW = 3.9 seconds
This thing will boil (!) a centimeter of water in 4 seconds. A typical consumer water cooler radiator would reduce the temperature of the coolant water by only 10-15 C relative to ambient, and wouldn't like it (I presume) if you pass in boiling water. To use water cooling you'd need some extreme flow rate and a big rack of radiators, right? I don't really know. I'm not even sure if that would work. How do you cool a chip at this power density?
I live in a small city/large town that has a large number of craft breweries. I always marveled at how these small operations were able to churn out so many different varieties. Turns out they are actually trying to make their few core recipes but the yield is so low they market the less consistent results as...all that variety I was so impressed with.
To summarize: localize defect contamination to a very small unit size, by making the cores tiny and redundant.
Analogous to a conglomerate wrapping each business vertical in a limited liability veil so that lawsuits and bankruptcy do not bring down the whole company. The smaller the subsidiaries, the less defect contamination but also the less scope for frictionless resource and information sharing.
> Second, a cluster of defects could overwhelm fault tolerant areas and disable the whole chip.
That’s an interesting point. In architecture class (which was basic and abstract so I’m sure Cerebras is doing something much more clever), we learned that defects cluster, but this is a good thing. A bunch of defects clustering on one core takes out the core, a bunch of defects not clustering could take out… a bunch of cores, maybe rendering the whole chip useless.
I wonder why they don’t like clustering. I could imagine in a network of little cores, maybe enough defects clustered on the network could… sort of overwhelm it, maybe?
Also I wonder how much they benefit from being on one giant wafer. It is definitely cool as hell. But could chiplets eat away at their advantage?
TSMC also have a manufacturing process used by Tesla's Dojo where you can cut up the chips, throw away the defective ones, and then reassemble working ones into a sort of wafer scale device (5x5 chips for Dojo). Seems like a more logical design to me.
I assume people are aware, but Cerebras has a web demo and API which is open to try and it is 2000 tokens per second for Llama 3.3 70b and 1000 tokens per second for Llama 3.1 405b.
https://cerebras.ai/inference
Understanding that there's inherent bias by them being competitors of the other companies, but still this article seems to make some stretches. If you told me you had an 8% core defect rate reduced 100x, I'd assume you got to close to 99% enablement. The table at the end shows... Otherwise.
They also keep flipping between cores, SMs, dies, and maybe other block sizes. At the end of the day I'm not very impressed. They seemingly have marginally better yields despite all that effort.
I have a dumb question. Why isn't silicon sold in cubes instead of cylinders?
Ever heard the old joke story about an American buyer told the Japanese manufacture how many incorrectly made bolts were acceptable per lot of a thousand bolts? Maybe 2 or 3 in 1,000?
So the Japanese didn't have any incorrectly made bolts in their manufacturing process so they just added two or three bad ones to every batch to please the Americans.
Very interesting. Am I correct in saying that fault tolerance here is with respect to 'static' errors that occur during manufacturing and are straightforward to detect before reaching the customer? Or can these failures potentially occur later on (and be tolerated) during the normal life of the chip?
How do these much smaller cores compare in computing power to the bigger ones? They seem to implicitly claim that a core is a core is a core, but surely one gets something extra out of the much bigger one?
56K mm2 vs 46K mm2. I wonder why they wouldn’t use the smart routing/etc to use more fitting shape than square and thus use more of the wafer.
When I was a kid, I used to get intel keychains with a die in acrylic - good job to whoever thought of that to sell the fully defective chips.
Bear case on Cerebras: https://irrationalanalysis.substack.com/p/cerebras-cbrso-equ...
Note: This author is heavily invested in Nvidia.
IIRC, it was Carl Bruggeman's IPSA Thesis that showed us how to laser out bad cores.
Looking at the H100 on the left, why is the chip yield (72) based on a circular layout/constraint? Why do they discard all of the other chips that fall outside the circle?
The yield problem is not surprising they found a solution. Maybe they could elaborate more on the power distribution and dissipation problem?
I would like a workstation with 900k cores. lmk when these things are on ebay.
My biggest question is who are the buyers?
Anyone has some picture how it is looks like inside these servers?
A well written, easy to understand article.
What's yield?