I've often thought that one of the places AMD could distinguish itself from NVIDIA is bringing significantly higher amounts of VRAM (or memory systems that are as performant as what we currently know as VRAM) to the consumer space.
A card with a fraction of the FLOPS of cutting-edge graphics cards (and ideally proportionally less power consumption), but with 64-128GB VRAM-equivalent, would be a gamechanger for letting people experiment with large multi-modal models, and seriously incentivize researchers to build the next generation of tensor abstraction libraries for both CUDA and ROCm/HIP. And for gaming, you could break new grounds on high-resolution textures. AMD would be back in the game.
Of course, if it's not real VRAM, it needs to be at least somewhat close on the latency and bandwidth front, so let's pop on over and see what's happening in this article...
> An Infinity Cache hit has a load-to-use latency of over 140 ns. Even DRAM on the AMD Ryzen 9 7950X3D shows less latency. Missing Infinity Cache of course drives latency up even higher, to a staggering 227 ns. HBM stands for High Bandwidth Memory, not low latency memory, and it shows.
> Of course, if it's not real VRAM, it needs to be at least somewhat close on the latency and bandwidth front
It is close to VRAM*, just not close to DRAM on a conventionally designed CPU. This thing is effectively just a GPU that fits in a CPU slot and has CPU cores bolted to the side. This approach has the downside of worse CPU performance and the upsides of orders of magnitude faster CPU<->GPU communication, simpler programming since coherency is handled for you, and access to substantial amounts of high bandwidth memory (up to 512GB with 4 MI300As).
Assuming we are comparing chips that are using the latest generation/high density memory modules, a wider bus width is required for larger memory counts, which is expensive when it comes to silicon area. Therefore, if AMD is willing to boost up memory count as a competitive advantage, they may as well also consider using that die space for more logic gates as well. It's a set of trade-offs and an optimization problem to some degree.
That said, when an incumbent has a leadership advantage, one of the obvious ways to boost profit is to slash the memory bus width, and then a competitor can come in and bring it up a bit and have a competitive offering. The industry has certainly seen this pattern many times. But as far as AMD coming in and using gigantic memory counts as a competitive advantage? You have to keep in mind the die space constraints.
Well over a decade ago - I think it was R600 - AMD did take this approach, and it was fairly disastrous because the logic performance of the chip wasn't good enough while the die was too big and hot and yields were too low. They didn't strike the right balance and sacrificed too much for a 512-bit memory bus.
AMD has also tried to sidestep some of these limitations with HBM back when it was new technology, but that didn't work out for them either. They actually would have been better off just increasing bus width and continuing to use the most optimized and cost efficient commodity memory chips in that case.
Data center and such may have a bit more freedom for innovation but the consumer space is definitely stuck on the paradigm of GPU plus nearby mem chips, and going outside of that fence is a huge latency hit.
> a wider bus width is required for larger memory counts
Most video cards wire up 32 data pins to each memory chip. But GDDR chips already have full support for running 16 pins to each chip. And DDR commonly goes down to 4 data pins per chip.
The latest GDDR7 chips are 24Gbit, and at 16 bits each you could fit 48GB onto a nice easy 256 bit bus, with a speed of at least 1TB/s. If you use 384 bits and/or send 8 to each chip, you can cram in so many chips it becomes a matter of fitting everything.
> a wider bus width is required for larger memory counts, which is expensive when it comes to silicon area
I find this constraint to be rather odd. An extra, say, three address bits would add very little space (or latency in a serial protocol) to a memory bus, and the actual problem seems to be that the current generation of memory chips are intended for point-to-point connection.
It seems to me that, if the memory vendors aren’t building physically larger, higher capacity chips, then any of the major players (AMD, Nvidia, Intel, whoever else is in this field right now) could kludge around it with a multiplexer. A multiplexer would need to be somewhat large, but its job would be simple enough that it should be doable with an older, cheaper process and without using entirely unreasonable amounts of power.
So my assumption is this is mostly an economic issue. The vendors don’t think it’s worthwhile to do this.
GDDR has been point-to-point since... I dunno, probably 2000? Because cet par you can't really have an actual bus when you chase maximum bandwidth. Even the double-sided layouts (like T-layout, with <2mm stubs) typically incur a reduction in data rate. These also dissipate a fair amount of heat, you're looking at around 5-8 W per chip (~6 pJ/bit), it's not like you can just stack a bunch of those dies.
> A multiplexer would need to be somewhat large, but its job would be simple enough that it should be doable with an older, cheaper process and without using entirely unreasonable amounts of power.
I don't know what you're basing that on. We're talking about 32 Gbps serdes here. Yes, there's multiplexers even for that. But what good is deciding which memory chip you want to use on boot-up?
Not multiplexed on boot — multiplexed at run time. Build a chip that speaks the GDDR protocol to the host GPU and has 2-4 GDDR channels coming out the other end and aggregates the attached memory at the cost of an extra chip, some latency, some power, and an extra chip. As far as the GPU is concerned, it’s an extra large GDDR chip, and it would allow a GPU vendor to squeeze in more RAM without adding more pins to the GPU or needing to route more memory channels directly to it.
(Compare to something like Apple’s designs or “Project Digits”. Current- and next-gen GPUs have considerably higher memory bandwidth but considerably less memory capacity. Mostly my point is that I think Nvidia or AMD could make a desktop-style GPU with 2-4x the RAM, somewhat worse latency, but otherwise equivalent performance without needing Samsung or another vendor to build higher capacity GDDR chips than currently exist.)
Bus width they are talking about are multiples of 128. I think Apple m series chips are good examples. They go from 128 to 256 to 512 bits which just happens to be roughly about the megabytes per second bandwidth.
They are. Strix Halo is going after that same space of Apple M4 Pro/Max where it is currently unchallenged. Pairing it with two 64GB LPCAMM2 modules will get you there.
Edit: The problem with AMD is less the hardware offerings, but more that their compute software stack historically tends to handwave or be very slow with consumer GPU support — even more so with their APUs. Maybe the advent of MI300A will change the equation, maybe not.
I don't know of any non-soldered memory Strix Halo devices, but both HP and Asus have announced 128GB SKUs (availability unknown).
For LLM inference, basically everything works w/ ROCm on RDNA3 now (well, Flash Attention is via Triton and doesn't have support for SWA and some other stuff; also I mostly test on Linux, although I did check that the new WSL2 support works). I've tested some older APUs w/ basic benchmarking as well. Notes here for those interested: https://llm-tracker.info/howto/AMD-GPUs
Thanks for that link. I'm interested in either getting the HP Mini Z1 G1a or an NVidia Digits for LLM experimentation. The obvious advantage for the Digits is the CUDA ecosystem is much more tried & true for that kind of thing. But the disadvantage is trying to use it as a replacement for my current PC as well as the fact that it's going to run an already old version of Ubuntu (22.04) and you're dependent on Nvidia for updates.
Yeah, I think anyone w/ old Jetsons knows what it's like to be left high and dry by Nvidia's embedded software support. Older models are basically just ewaste. Since the Digits won't be out until May, I guess there's enough time to wait and see - at least to get a sense of what the actual specs are. I have a feeling the FP16 TFLOPS and the MBW are going to be much lower than what people have been hyping themselves up for.
Sadly, my feeling is that the big Strix Halo SKUs (which have no scheduled release dates) aren't going to be competitively priced (they're likely to be at a big FLOPS/real-world performance disadvantage, and there's still the PITA factor), but there is something appealing about about the do-it-all aspect of it.
DIGITS looks like a serious attempt, but they don’t have too much of an incentive to have people developing for older hardware. I wouldn’t expect them to supor it for more than five years. At least the underlying Ubuntu will last more than that and provide a viable work environment far beyond the time it gets really boring.
Getting their kernel mods upstreamed is very unlikely, but they might provide just enough you can build a new kernel with the same major version number.
Who said anything about Ubuntu 22.04? I mean sure that's the newest release current jetpack comes with, but I'd be surprised if they shipped digits with that.
> Pairing it with two 64GB LPCAMM2 modules will get you there.
It gets you closer for sure. But while ~250GB/s is a whole lot better than SO-DIMMs at ~100GB/s, the new mid-tier GPUs are probably more like 640-900GB/s.
I wholeheartedly agree. Nvidia is intentionally suppressing the amount of memory on their consumer GPUs to prevent data centers from using consumer cards rather than their far more expensive counterparts. The fact that they used to offer the 3060 with 12GB, but have now pushed the pricing higher and limited many cards to 8GB is a testament to the fact they are. I don’t need giga-TOPS with 8-16gb of memory, I’d be perfectly happy with half that speed but with 64gb of memory or more. Even slower memory would be fine. I don’t need 1000t/s, but being able to load a reasonable intelligent model even at 50t/s would be great.
Getting to 50 tok/s for a big model requires not just memory, but also memory bandwidth. Currently, 1TB/s of MBW will get a 70B Q4 (~40GB) model to about 20-25 tok/s. The good thing is models continue to get smarter - today's 20-30B models beat out last years 70B models on most tasks and the biggest open models like DeepSeek-v3 might have lots of weights, but actually a relatively reasonable # of activations/pass.
You can test out your half the speed but w/ 64GB or more of memory w/ the latest Macs, AMD Strix Halo, or the upcoming Nvidia Digits, though. I suspect by the middle of the year there will be a bunch of options in the ~$3K range. Personally, I think I'd rather go for 2 x 5090s for 64GB of memory at 1.7TB/s than 96 or 128GB w/ only 250GB/s of MBW.
A Mac with that memory will have closer to 500GB/s but your point still stands.
That said, if you just want to play around, having more memory will let you do more interesting things. I’d rather have that option over speed since I won’t be doing production inference serving on my laptop.
Yeah, the M4 Max actually has pretty decent MBW - 546 GB/s (cheapest config is $4.7K on a 14" MBP atm, but maybe there will be a Mac Studio at some point). The big weakness for the Mac is actually the lack of TFLOPS on the GPU - the beefiest maxes out at ~34 FP16 TFLOPS. It makes a lot of use cases super painful, since prefill/prompt processing can take minutes before token generation starts.
128G of unified memory. $3K. Throw ollama and ComfyUI on that sucker and things could get interesting. The question is how much slower than a 5090, is this gonna be? The memory bandwidth isn't going to match a 512 bit bus.
That's actually a good thing. That's how you get a ton of DRAM without it costing a fortune. M2 Ultra is able to get GPU-like 800GB/sec with DDR4. From that it follows that if you can design a specialized chip, you can get a respectable 1 TB/sec quite easily with LPDDR5, provided that you're willing to design a chip to support a ton of memory channels (and potentially also a wider memory bus). In fact, I'm baffled that such devices don't already exist outside Apple's product line. Seems like a rather obvious thing to do, and Apple has a "proof of concept" already. I can think of at least four companies off the top of my head that could do it quite easily, besides Apple.
The market for prosumer cards with high VRAM and low FLOPS would be negligibly small. The data center market is massive on one end and the gaming market is big on the other. Casual consumers who just want a lot of VRAM are such a small minority of people that it doesn’t matter to the bottom line.
It also wouldn’t be financially advantageous to divert RAM chips away from data center production. We don’t have a surplus of chips waiting to be installed, so building out high VRAM but affordable cards would only take away from higher margin products in the datacenter space.
> The market for prosumer cards with high VRAM and low FLOPS would be negligibly small. The data center market is massive on one end and the gaming market is big on the other. Casual consumers who just want a lot of VRAM are such a small minority of people that it doesn’t matter to the bottom line.
I'm sure this is also what AMD is thinking, and it's also why they will never catch up to NVidia in ecosystem and software support.
It's not for the casual consumers, and it's not supposed to make money directly! You want these high VRAM SKUs to attract enthusiast and researchers. I have read a staggering amount of research papers where the authors used some random consumer NVidia GPU. Do you know how many I've read which used AMD GPUs? Big fat ZERO! You want to incentivize these people to use your hardware? You want to get devs to support your platform? Give them a unique value proposition that the competition won't.
I'm currently waiting for the 5090 to be available, and I'm going to buy two of them. If AMD would have released a GPU at a fair price, with reasonable performance and double the VRAM that NVidia offers, do you know what would I do? I would buy two AMD cards instead, port my software to it, and contribute PRs to any upstream software that I use so that it works with these cards. But alas, here we are.
> You want these high VRAM SKUs to attract enthusiast and researchers. I have read a staggering amount of research papers where the authors used some random consumer NVidia GPU. Do you know how many I've read which used AMD GPUs? Big fat ZERO!
I'm just sitting here wondering how you think this affects anything? Enterprise doesn't buy DC cards based on research papers so why does it matter if research papers are or aren't written against one brand or the other.
However, that target audience, those hobby enthusiasts, hobby developers, also university labs with low budget, those are the people who will develop the future open source frameworks, and ultimately/implicitly those are the people who can have a quite big impact on the future development of brand recognition and the open source ecosystem around the hardware. Those people can shape the future trends.
So, only looking at the market, how much units you would sell here, that totally ignores the impact this might have indirectly in the future.
> However, that target audience, those hobby enthusiasts, hobby developers, also university labs with low budget, those are the people who will develop the future open source frameworks,
No they're not. Y'all are deluded. There's a reason why the are only two real DNN frameworks and both of them are developed at the two biggest tech companies in the world.
Actually there's a lot of demand in the AI data center space for such a card, such as for running large mixture of experts (MoE) models -- e.g. DeepSeek v3, which is one of the best LLMs in the world today.
Although AMD would need to greatly improve their entire software stack to make running AI models on AMD an attractive proposition.
The problem with only providing VRAM is that some AI things like real time audio processing under preform significantly because it does not have the equivalent of tensor cores to keep up. There are LLM's that won't run for the same reason. You will have more than enough VRAM but not enough tensor cores. AMD isn't able to compete.
If, by the grace of tech Jesus, amd gave us such systems at volumes Nvidia would notice, Nvidia would simply then do the same but with a better ecosystem.
The biggest problem for AMD is not that the majority of people want to use AMD. It is that the majority of people want AMD to be more competitive so that Nvidia will be forced to drop prices so that people can afford Nvidia products.
Until this pattern changes, AMD has a big uphill battle. Same for Intel, except Intel is at least seemingly doing great gen/gen improvements in mid/low range consumer GPUs and bringing healthy vram along for the ride.
The difference with AMD and Intel when zen launched is that AMD launched a product that utterly destroyed Intel’s line up in productivity workloads. Zen 1 launched with double the cores of the competing intel chip at the same price point. The benchmarks were a bloodbath and intel struggled to respond with a competitive product for 4 years. Arguably they still haven’t caught up. AMD just brutally out executed Intel.
Core wise Intel had the advantage until the last generation or two. The same can be true for gpus, just add a ton more memory and watch them fly off the shelves.
Intel can match or outperform Zen 5 in many benchmarks (X3D still trashes them in games) and are trading blows now, they just have to use double the power envelope to do it.
Arc and Battlemage are not very competitive designs with AMD going off die size and transistor count compared to the performance numbers they're posting. Battlemage pricing however is quite good on price to performance, but again suffers from efficiency where AMD has them beat by quite a margin.
> The same can be true for gpus, just add a ton more memory and watch them fly off the shelves.
Yeah... for datacenters and people attempting to jump on the AI hype train. Meanwhile your everyday regular gamer has zero chance competing for GPUs with the infinite money coffers from AI.
Seriously, the sooner this crazy bubble bursts the better. I thought the shitcoin mining days were bad but at least everyone back then knew the game for GPUs was over once the first Bitcoin ASIC was released, but now? No end in sight and frankly I'm pissed.
> If, by the grace of tech Jesus, amd gave us such systems at volumes Nvidia would notice, Nvidia would simply then do the same but with a better ecosystem.
Not if they have "a better ecosystem" -- they would continue to charge a premium for that.
Which creates a dilemma for Nvidia. If they would match AMD's pricing, they'd be losing all the money they could get by charging more, which is a ton. Whereas if they charge more, they get more today from the people who pay the premium, but some people are more price sensitive than others, so there are still a lot of people who would buy "lots of VRAM for less money" from AMD. And soon AMD has a lot of users, improves their software support and the difference disappears entirely.
Forcing the larger competitor into that dilemma is very much to the advantage of the smaller competitor.
For traditional LLMs this might be true (especially large MoEs at bs=1) but I highly disagree with "multi-modal models" phrase since most of the models that output in other modalities are generally compute bound. Which means less flops will make the experience so much worse (imagine waiting a couple minutes for an image and hours for videos).
So the 300A is an accelerator coupled with a full 24-core EPYC and 128GB of HBM all on a single chip (or, packaged chiplets, whatever).
Why is it I can't buy a single one of these, on a motherboard, in a workstation format case, to use as an insane workstation? Assuming you could program for the accelerator part, there is an entire world of x86-fixed CAD, engineering, and entertainment industry (rendering, etc) where people want a single, desktop machine with 128GB + of fast ram to number crunch.
There are Blender artists out there that build dual and quad RTX4090 machines with Threadrippers for $20k+ in components all day, because their render jobs pay for it.
There are engineering companies that would not bat an eye at dropping $30k on a workstation if it mean they could spin around 80 gigabyte CATIA models of cars or aircraft loaded in RAM quicker. I know this at least because I sure as hell did with with several HP Z-series machines costing whole-Toyota-Corolla prices over the years...
But these combined APU chips are relegated to these server units. In the end is this a driver problem? Just a software problem? A chicken and egg problem where no one is developing the support because there isn't the hardware on the market, and there isn't the hardware on the market because AMD thinks there is no use case?
Edit: and note my use cases mentioned don't rely on latency, really, like videogamers need to hit framerates. The cache miss latency mentioned in the article doesn't matter as much for these type of compute applications where the main problems are just loading and unloading the massive amount of data. Things like offline renders and post-processing CFD simulations. Not necessarily a video output framerate.
> Why is it I can't buy a single one of these, on a motherboard, in a workstation format case, to use as an insane workstation?
AMD doesn't have the resources to support end users for something like this. They are a public company, look at their spend. They are pouring everything they've got into trying to keep up with the Nvidia release cycle for AI chips.
These chips are cutting edge, they are not perfect. They are still working through the hardware and software issues. It is hard enough to deal with all the public opinion on things as it is. Why would they add another layer of potential abuse?
The people who buy stuff like that are professionals. They often know something about the tools they're using and if there are any problems, provide bug reports that actually describe what's happening instead of some non-descriptive mush like "I have your GPU and Windows crashes sometimes". That is extremely helpful if you're trying to get rid of those bugs.
This is the same reason software shops have found it useful to support Linux, even if not many people use it. The people who do will make your product suck less, which in turn makes it easier to sell to the mass market, which will get upset and think unfavorably of you if they have the same problem but not be as good at telling you about it.
Our users give them plenty of feedback. They just RMA'd whole bunch of our GPUs over this issue so that they could take them back to the mothership and figure out what's up...
It's not that you don't get bug reports from data center customers, it's that data center customers have scale in a bad way. They buy thousands of GPUs, they do whatever they're going to do with them, they have a problem, they report the bug. One bug report across thousands of GPUs, because they're all being used for the same thing by that customer so they only have the problems you have when you try to do that. Another data center buys thousands of GPUs and they're doing something else which is extremely common and well supported, so they don't have any issues and you get zero bug reports from them.
Compare this to, you sell a thousand GPUs to a thousand professionals and 10% of them have some problem, but each a different one. You get 100 bug reports, you fix 100 bugs instead of just one, things improve much faster.
We have 136 of these things. Not thousands. AMD is intentionally keeping their number of providers limited [0](bottom of page).
No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.
These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.
If you're now also suggesting that AMD also release another product that is easier for developers to get their hands on and deploy, then now you've totally lost me. You're exponentially trying to increase the amount of work and money they spend, for what? Some feedback?
I think you underestimate the people here when you throw around things like "it costs as much as an expensive Ferrari". a lot of us work with systems like these, so we understand why they cost so much and what they can do. On Reddit this works, here, I feel this is pretty condescending.
"Intentionally limiting" is just koolaid. It's ok to drink it, it's your business, but it's koolaid. You think if AWS wanted to deploy a couple hundred thousand of these systems, AMD would be sad? I bet Lisa would be happy.
I tried renting a system, and putting in a credit card is not enough. That's a red flag for me. I don't want to email, chat with sales, etc, just put in a card number. This works for even GH200 systems over at lambda.
As for number of SKUs, for Blackwell there are a lot, if you believe Jensen, and why wouldn't you? He stated at CES that almost every DC they go into is a bit bespoke with modifications.
AMD seems unable to execute on this, which is reflected in its share price.
That's a number within an order of magnitude, and you're presumably not the largest provider.
> No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.
If you own something and you've having problems with it, you're more inclined to try to solve them. If you're renting something and you have problems with it, you're more inclined to rent something else instead.
> These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.
Making only 4-socket systems was a choice.
You're also acting like multiple SKUs are something weird. Start offering Ryzen APUs with some on-package GDDR or HBM. Make something that fits in the Threadripper socket and uses PCIe power connectors for extra power. People would buy these things.
The point is to create lots of systems in the hands of lots of people that use the same general hardware architecture so that you're improving its software support.
> provide bug reports that actually describe what's happening
Doesn’t matter if the bug reports are good or bad. Supporting low volume applications is a bad business move when the alternative is 9-figure data center contracts.
The data center business is orders of magnitude larger. Trying to support individual developers would be a huge business mistake when they already can’t keep up with data center.
It's the same hardware running the same software. You want the bug reports so you can fix them and then your data center customers don't encounter them when they're evaluating your product.
What they can keep up with is basically a matter of how much capacity they order from TSMC. If they underestimated demand for some generation, that's the sort of thing you fix with the next contract or you're just throwing money away.
Data center orders are high volume and allow long lead times. You can collect orders, collect money, and then agree when to deliver huge batches of product.
Selling one off chips isn’t attractive at all in this context. Selling a couple parts to the rare Blender artist is nothing relative to the data center buildouts with billion dollar budgets.
Every one-off part you sell takes resources and inventory away from landing those big contracts.
Supermicro sell them, https://www.supermicro.com/en/accelerators/amd. Other companies probably do too. I'm excited about the ~100W class APUs just announced at CES, hoping for one in a vesa mount format.
>Still, core to core transfers are very rare in practice. I consider core to core latency test results to be just about irrelevant to application performance. I’m only showing test results here to explain the system topology.
How exactly are "applications" developed for this? Or is that all proprietary knowledge? TinyBox has resorted to writing their own drivers for 7900 XTX
ROCm is the stack that people write code against to talk to AMD hardware.
George wrote some incomplete non-perfomant drivers for a consumer grade product. Certainly not an easy task, but it also isn't something that most people would use. George just makes loud noises to get attention, but few in the HPC industry pay any attention to him.
Yes ROCm is for the GPU, but the MI300A also includes 4 clusters of cpus connected by an infinity fabric. Generally this kind of thing is handled by the OS but there is no OS for this product.
AMD has been doing IF-connected CCDs/chiplets for a while now - since Zen 1, released in 2017. All the x86 OSes work fine on each iteration.
Application authors who care about wringing out the last drop of performance need to be mindful about how they manage processes and cache lines on this hardware - as they would on any other architecture
What do you mean no OS? You log into whatever Linux distribution someone put on it, that's one of the better magic tricks from having a collection of x64 cores on the same chip. Or I suppose you roll a unikernel style system if you want to.
Nobody cares what HPC industry has to say; until recently, they have happily been jerking off Monte-Carlo simulations on overpriced nation-grade supercomputer NUMA clusters and didn't know what a "GPU" was anyway! Also please stop spreading "consumer grade product" propaganda. I had used AMD Instinct MI50's—supposedly datacenter-grade hardware, and have faced the exact same problems as George. Except in my case there was no call-line at Lisa's.
Guess what, the AI industry has spoken: hyper-scalers would buy NVIDIA, or rather design their own silicon. Any thing, any how, but nothing to do with AMD.
Also: if your business is doing so great, how come you're constantly in all these Hacker News threads talking and talking and talking but not actually releasing products of any kind, of any bread, that any of the hackers on here could use?
From the looks of it, YOU ARE the product. That is, manufacturing optics of a "partner" and "distributer" ecosystem for AMD. And on borrowed time, too.
Please don't be salty; the only person here who may embarrass you is yourself. I'm happy that you like to think about yourself as CEO, but perhaps it's worth reflecting you may be doing a better job if you had spent less time on Hacker News, and more time figuring out how to get Hacker News excited about your product? So far you have pledged allegiance to AMD every chance you got, and spun tall tales of great capability, with not much to show for it besides "partners." You know nobody has trained a thing with your GPU's yet? That would be a great place to start for a CEO. To make something people would use. To justify it to us; as AMD themselves have clearly justified your existence there's no work there!
It's just tough words from a nobody, don't worry you'll be fine!
That is quite a thing. I've been out of the 'design loop' for chips like this for a while so I don't know if they still do full chip simulations prior to tapeout but woah trying to simulate that thing would take quite the compute complex in itself. Hat's off to AMD for getting it out the door.
You basically cannot do anything worthwhile in this space without violating someone's patents. It's beneficial patent and corporate lawyers, but it's detrimental to innovation. As an engineer you are asked to not look up existing techniques or designs as this will taint you legally.
"Tainting" isn't a thing in patent law. All engineers worldwide are tainted the moment the patent is published; that's why parallel reinvention is not a defense to patent infringement.
But you pay triple damages if you knowingly vs unknowingly violate a patent (35 U.S.C. § 284). Of course, everything is patented, so, engineers are just told to not read patents.
MI300 is an insanely good GPU. There is nothing that Nvidia sells that even comes close. The H100 only has 80GB of memory, whereas MI300 has 192GB. If you are training large models, AMD is the way to go.
Funny you say that, because nobody serious about AI is actually using Nvidia unless they're already locked in with CUDA.
Highest performing inference engines all use Vulkan, and are either faster per dollarwatt on the CDNA3 cards or (surprisingly) the RDNA3 cards, not Lovelace.
Meta has an in-house accelerator that the Triton inference engine supports (which they use almost exclusively for their fake content/fake profiles project). Triton is legacy software and, afaik, does not have a Vulcan backend, so Meta may be locked out of better options until it does.
That doesn't stop Meta's Llama family of models running on anything and everything _outside_ of Meta, though. Llama.cpp works on everything, for example, but Meta doesn't use it.
CUDA lock-in is not what it once was. I do a lot of stable diffusion and I was pleasantly suprised that I could just run the same code on AMD with no changes.
Yeah, the labour involved in running non Nvidia equipment is the elephant in the room.
Nvidia GPU: spin up OS, run your sims or load your LLM, gather results.
AMD GPU: spin up OS, grok driver fixes, try and run your sims, grok more driver fixes, can't even gather results until you can verify software correctness of your fixes. Yeah, sometimes you need someone with specialized knowledge of numerical methods to help tune your fixes.
... What kind of maddening workflows are these? It's literally negative work: you are busy, you barely get anywhere, and you end up having to do more.
In light of that, the Nvidia tax doesn't look so bad.
I've often thought that one of the places AMD could distinguish itself from NVIDIA is bringing significantly higher amounts of VRAM (or memory systems that are as performant as what we currently know as VRAM) to the consumer space.
A card with a fraction of the FLOPS of cutting-edge graphics cards (and ideally proportionally less power consumption), but with 64-128GB VRAM-equivalent, would be a gamechanger for letting people experiment with large multi-modal models, and seriously incentivize researchers to build the next generation of tensor abstraction libraries for both CUDA and ROCm/HIP. And for gaming, you could break new grounds on high-resolution textures. AMD would be back in the game.
Of course, if it's not real VRAM, it needs to be at least somewhat close on the latency and bandwidth front, so let's pop on over and see what's happening in this article...
> An Infinity Cache hit has a load-to-use latency of over 140 ns. Even DRAM on the AMD Ryzen 9 7950X3D shows less latency. Missing Infinity Cache of course drives latency up even higher, to a staggering 227 ns. HBM stands for High Bandwidth Memory, not low latency memory, and it shows.
Welp. Guess my wish isn't coming true today.
> Of course, if it's not real VRAM, it needs to be at least somewhat close on the latency and bandwidth front
It is close to VRAM*, just not close to DRAM on a conventionally designed CPU. This thing is effectively just a GPU that fits in a CPU slot and has CPU cores bolted to the side. This approach has the downside of worse CPU performance and the upsides of orders of magnitude faster CPU<->GPU communication, simpler programming since coherency is handled for you, and access to substantial amounts of high bandwidth memory (up to 512GB with 4 MI300As).
* https://chipsandcheese.com/p/microbenchmarking-nvidias-rtx-4...
I was curious because given the latencies between the CCXs, the number of NUMA domains seems small.
Assuming we are comparing chips that are using the latest generation/high density memory modules, a wider bus width is required for larger memory counts, which is expensive when it comes to silicon area. Therefore, if AMD is willing to boost up memory count as a competitive advantage, they may as well also consider using that die space for more logic gates as well. It's a set of trade-offs and an optimization problem to some degree.
That said, when an incumbent has a leadership advantage, one of the obvious ways to boost profit is to slash the memory bus width, and then a competitor can come in and bring it up a bit and have a competitive offering. The industry has certainly seen this pattern many times. But as far as AMD coming in and using gigantic memory counts as a competitive advantage? You have to keep in mind the die space constraints.
Well over a decade ago - I think it was R600 - AMD did take this approach, and it was fairly disastrous because the logic performance of the chip wasn't good enough while the die was too big and hot and yields were too low. They didn't strike the right balance and sacrificed too much for a 512-bit memory bus.
AMD has also tried to sidestep some of these limitations with HBM back when it was new technology, but that didn't work out for them either. They actually would have been better off just increasing bus width and continuing to use the most optimized and cost efficient commodity memory chips in that case.
Data center and such may have a bit more freedom for innovation but the consumer space is definitely stuck on the paradigm of GPU plus nearby mem chips, and going outside of that fence is a huge latency hit.
> a wider bus width is required for larger memory counts
Most video cards wire up 32 data pins to each memory chip. But GDDR chips already have full support for running 16 pins to each chip. And DDR commonly goes down to 4 data pins per chip.
The latest GDDR7 chips are 24Gbit, and at 16 bits each you could fit 48GB onto a nice easy 256 bit bus, with a speed of at least 1TB/s. If you use 384 bits and/or send 8 to each chip, you can cram in so many chips it becomes a matter of fitting everything.
> a wider bus width is required for larger memory counts, which is expensive when it comes to silicon area
I find this constraint to be rather odd. An extra, say, three address bits would add very little space (or latency in a serial protocol) to a memory bus, and the actual problem seems to be that the current generation of memory chips are intended for point-to-point connection.
It seems to me that, if the memory vendors aren’t building physically larger, higher capacity chips, then any of the major players (AMD, Nvidia, Intel, whoever else is in this field right now) could kludge around it with a multiplexer. A multiplexer would need to be somewhat large, but its job would be simple enough that it should be doable with an older, cheaper process and without using entirely unreasonable amounts of power.
So my assumption is this is mostly an economic issue. The vendors don’t think it’s worthwhile to do this.
GDDR has been point-to-point since... I dunno, probably 2000? Because cet par you can't really have an actual bus when you chase maximum bandwidth. Even the double-sided layouts (like T-layout, with <2mm stubs) typically incur a reduction in data rate. These also dissipate a fair amount of heat, you're looking at around 5-8 W per chip (~6 pJ/bit), it's not like you can just stack a bunch of those dies.
> A multiplexer would need to be somewhat large, but its job would be simple enough that it should be doable with an older, cheaper process and without using entirely unreasonable amounts of power.
I don't know what you're basing that on. We're talking about 32 Gbps serdes here. Yes, there's multiplexers even for that. But what good is deciding which memory chip you want to use on boot-up?
Not multiplexed on boot — multiplexed at run time. Build a chip that speaks the GDDR protocol to the host GPU and has 2-4 GDDR channels coming out the other end and aggregates the attached memory at the cost of an extra chip, some latency, some power, and an extra chip. As far as the GPU is concerned, it’s an extra large GDDR chip, and it would allow a GPU vendor to squeeze in more RAM without adding more pins to the GPU or needing to route more memory channels directly to it.
(Compare to something like Apple’s designs or “Project Digits”. Current- and next-gen GPUs have considerably higher memory bandwidth but considerably less memory capacity. Mostly my point is that I think Nvidia or AMD could make a desktop-style GPU with 2-4x the RAM, somewhat worse latency, but otherwise equivalent performance without needing Samsung or another vendor to build higher capacity GDDR chips than currently exist.)
Bus width they are talking about are multiples of 128. I think Apple m series chips are good examples. They go from 128 to 256 to 512 bits which just happens to be roughly about the megabytes per second bandwidth.
They are. Strix Halo is going after that same space of Apple M4 Pro/Max where it is currently unchallenged. Pairing it with two 64GB LPCAMM2 modules will get you there.
Edit: The problem with AMD is less the hardware offerings, but more that their compute software stack historically tends to handwave or be very slow with consumer GPU support — even more so with their APUs. Maybe the advent of MI300A will change the equation, maybe not.
I don't know of any non-soldered memory Strix Halo devices, but both HP and Asus have announced 128GB SKUs (availability unknown).
For LLM inference, basically everything works w/ ROCm on RDNA3 now (well, Flash Attention is via Triton and doesn't have support for SWA and some other stuff; also I mostly test on Linux, although I did check that the new WSL2 support works). I've tested some older APUs w/ basic benchmarking as well. Notes here for those interested: https://llm-tracker.info/howto/AMD-GPUs
Thanks for that link. I'm interested in either getting the HP Mini Z1 G1a or an NVidia Digits for LLM experimentation. The obvious advantage for the Digits is the CUDA ecosystem is much more tried & true for that kind of thing. But the disadvantage is trying to use it as a replacement for my current PC as well as the fact that it's going to run an already old version of Ubuntu (22.04) and you're dependent on Nvidia for updates.
Yeah, I think anyone w/ old Jetsons knows what it's like to be left high and dry by Nvidia's embedded software support. Older models are basically just ewaste. Since the Digits won't be out until May, I guess there's enough time to wait and see - at least to get a sense of what the actual specs are. I have a feeling the FP16 TFLOPS and the MBW are going to be much lower than what people have been hyping themselves up for.
Sadly, my feeling is that the big Strix Halo SKUs (which have no scheduled release dates) aren't going to be competitively priced (they're likely to be at a big FLOPS/real-world performance disadvantage, and there's still the PITA factor), but there is something appealing about about the do-it-all aspect of it.
DIGITS looks like a serious attempt, but they don’t have too much of an incentive to have people developing for older hardware. I wouldn’t expect them to supor it for more than five years. At least the underlying Ubuntu will last more than that and provide a viable work environment far beyond the time it gets really boring.
If only they could get their changes upstreamed to Ubuntu (and possible kernel mods upstreamed), then we wouldn't have to worry about it.
Getting their kernel mods upstreamed is very unlikely, but they might provide just enough you can build a new kernel with the same major version number.
Who said anything about Ubuntu 22.04? I mean sure that's the newest release current jetpack comes with, but I'd be surprised if they shipped digits with that.
Doesn’t DGX OS use the latest LTS version? Current should be 24.04.
I wouldn't know. I only work with workstation or jetson stuff.
The DGX documentation and downloads aren't public afaik.
Edit: Nevermind, some information about DGX is public and they really are on 22.04, but oh well, the deep learning stack is guaranteed to run.
https://docs.nvidia.com/base-os/too
> Pairing it with two 64GB LPCAMM2 modules will get you there.
It gets you closer for sure. But while ~250GB/s is a whole lot better than SO-DIMMs at ~100GB/s, the new mid-tier GPUs are probably more like 640-900GB/s.
I wholeheartedly agree. Nvidia is intentionally suppressing the amount of memory on their consumer GPUs to prevent data centers from using consumer cards rather than their far more expensive counterparts. The fact that they used to offer the 3060 with 12GB, but have now pushed the pricing higher and limited many cards to 8GB is a testament to the fact they are. I don’t need giga-TOPS with 8-16gb of memory, I’d be perfectly happy with half that speed but with 64gb of memory or more. Even slower memory would be fine. I don’t need 1000t/s, but being able to load a reasonable intelligent model even at 50t/s would be great.
Getting to 50 tok/s for a big model requires not just memory, but also memory bandwidth. Currently, 1TB/s of MBW will get a 70B Q4 (~40GB) model to about 20-25 tok/s. The good thing is models continue to get smarter - today's 20-30B models beat out last years 70B models on most tasks and the biggest open models like DeepSeek-v3 might have lots of weights, but actually a relatively reasonable # of activations/pass.
You can test out your half the speed but w/ 64GB or more of memory w/ the latest Macs, AMD Strix Halo, or the upcoming Nvidia Digits, though. I suspect by the middle of the year there will be a bunch of options in the ~$3K range. Personally, I think I'd rather go for 2 x 5090s for 64GB of memory at 1.7TB/s than 96 or 128GB w/ only 250GB/s of MBW.
A Mac with that memory will have closer to 500GB/s but your point still stands.
That said, if you just want to play around, having more memory will let you do more interesting things. I’d rather have that option over speed since I won’t be doing production inference serving on my laptop.
Yeah, the M4 Max actually has pretty decent MBW - 546 GB/s (cheapest config is $4.7K on a 14" MBP atm, but maybe there will be a Mac Studio at some point). The big weakness for the Mac is actually the lack of TFLOPS on the GPU - the beefiest maxes out at ~34 FP16 TFLOPS. It makes a lot of use cases super painful, since prefill/prompt processing can take minutes before token generation starts.
You're not the only one thinking that: https://www.nvidia.com/en-us/project-digits/
128G of unified memory. $3K. Throw ollama and ComfyUI on that sucker and things could get interesting. The question is how much slower than a 5090, is this gonna be? The memory bandwidth isn't going to match a 512 bit bus.
It's going to be waaay slower than a 5090. We're looking at something like 60W TDP for the entire system vs 600W for a 5090 GPU.
It's going to be very energy efficient, it will get plenty of flops, but they won't be able to cheat physics.
AFAIK this uses even slower memory.
And a fraction of the 5090 cores.
I think digits is STARTS AT $3k. We'll see.
It's LPDDR5.
That's actually a good thing. That's how you get a ton of DRAM without it costing a fortune. M2 Ultra is able to get GPU-like 800GB/sec with DDR4. From that it follows that if you can design a specialized chip, you can get a respectable 1 TB/sec quite easily with LPDDR5, provided that you're willing to design a chip to support a ton of memory channels (and potentially also a wider memory bus). In fact, I'm baffled that such devices don't already exist outside Apple's product line. Seems like a rather obvious thing to do, and Apple has a "proof of concept" already. I can think of at least four companies off the top of my head that could do it quite easily, besides Apple.
> AMD would be back in the game.
The market for prosumer cards with high VRAM and low FLOPS would be negligibly small. The data center market is massive on one end and the gaming market is big on the other. Casual consumers who just want a lot of VRAM are such a small minority of people that it doesn’t matter to the bottom line.
It also wouldn’t be financially advantageous to divert RAM chips away from data center production. We don’t have a surplus of chips waiting to be installed, so building out high VRAM but affordable cards would only take away from higher margin products in the datacenter space.
> The market for prosumer cards with high VRAM and low FLOPS would be negligibly small. The data center market is massive on one end and the gaming market is big on the other. Casual consumers who just want a lot of VRAM are such a small minority of people that it doesn’t matter to the bottom line.
I'm sure this is also what AMD is thinking, and it's also why they will never catch up to NVidia in ecosystem and software support.
It's not for the casual consumers, and it's not supposed to make money directly! You want these high VRAM SKUs to attract enthusiast and researchers. I have read a staggering amount of research papers where the authors used some random consumer NVidia GPU. Do you know how many I've read which used AMD GPUs? Big fat ZERO! You want to incentivize these people to use your hardware? You want to get devs to support your platform? Give them a unique value proposition that the competition won't.
I'm currently waiting for the 5090 to be available, and I'm going to buy two of them. If AMD would have released a GPU at a fair price, with reasonable performance and double the VRAM that NVidia offers, do you know what would I do? I would buy two AMD cards instead, port my software to it, and contribute PRs to any upstream software that I use so that it works with these cards. But alas, here we are.
> You want these high VRAM SKUs to attract enthusiast and researchers. I have read a staggering amount of research papers where the authors used some random consumer NVidia GPU. Do you know how many I've read which used AMD GPUs? Big fat ZERO!
I'm just sitting here wondering how you think this affects anything? Enterprise doesn't buy DC cards based on research papers so why does it matter if research papers are or aren't written against one brand or the other.
[dead]
You might be true for the market.
However, that target audience, those hobby enthusiasts, hobby developers, also university labs with low budget, those are the people who will develop the future open source frameworks, and ultimately/implicitly those are the people who can have a quite big impact on the future development of brand recognition and the open source ecosystem around the hardware. Those people can shape the future trends.
So, only looking at the market, how much units you would sell here, that totally ignores the impact this might have indirectly in the future.
> However, that target audience, those hobby enthusiasts, hobby developers, also university labs with low budget, those are the people who will develop the future open source frameworks,
No they're not. Y'all are deluded. There's a reason why the are only two real DNN frameworks and both of them are developed at the two biggest tech companies in the world.
> The market for prosumer cards with high VRAM and low FLOPS would be negligibly small.
I don't agree. I regularly get VSCode crashing because it ran out of VRAM.
8GB VRAM starts to feel cramped when you have to composite multiple web browsers (aka Electron apps) onto your 4K monitor screen.
nVidia not offering 16GB on consumer level cards is purely a market segmentation strategy and AMD should make them pay for it.
Actually there's a lot of demand in the AI data center space for such a card, such as for running large mixture of experts (MoE) models -- e.g. DeepSeek v3, which is one of the best LLMs in the world today.
Although AMD would need to greatly improve their entire software stack to make running AI models on AMD an attractive proposition.
Totally normal latencies for a GPU though.
The problem with only providing VRAM is that some AI things like real time audio processing under preform significantly because it does not have the equivalent of tensor cores to keep up. There are LLM's that won't run for the same reason. You will have more than enough VRAM but not enough tensor cores. AMD isn't able to compete.
If, by the grace of tech Jesus, amd gave us such systems at volumes Nvidia would notice, Nvidia would simply then do the same but with a better ecosystem.
The biggest problem for AMD is not that the majority of people want to use AMD. It is that the majority of people want AMD to be more competitive so that Nvidia will be forced to drop prices so that people can afford Nvidia products.
Until this pattern changes, AMD has a big uphill battle. Same for Intel, except Intel is at least seemingly doing great gen/gen improvements in mid/low range consumer GPUs and bringing healthy vram along for the ride.
The same could ba said for CPUs from Intel and AMD 5 years ago. Now people, myself included, buy AMD because it is simply the better choice.
The difference with AMD and Intel when zen launched is that AMD launched a product that utterly destroyed Intel’s line up in productivity workloads. Zen 1 launched with double the cores of the competing intel chip at the same price point. The benchmarks were a bloodbath and intel struggled to respond with a competitive product for 4 years. Arguably they still haven’t caught up. AMD just brutally out executed Intel.
Doing that to nvidia would be a tall order
Core wise Intel had the advantage until the last generation or two. The same can be true for gpus, just add a ton more memory and watch them fly off the shelves.
Intel can match or outperform Zen 5 in many benchmarks (X3D still trashes them in games) and are trading blows now, they just have to use double the power envelope to do it.
Arc and Battlemage are not very competitive designs with AMD going off die size and transistor count compared to the performance numbers they're posting. Battlemage pricing however is quite good on price to performance, but again suffers from efficiency where AMD has them beat by quite a margin.
Intel P cores still do well against amd zen5. But their stacking cache is chef's kiss.
> The same can be true for gpus, just add a ton more memory and watch them fly off the shelves.
Yeah... for datacenters and people attempting to jump on the AI hype train. Meanwhile your everyday regular gamer has zero chance competing for GPUs with the infinite money coffers from AI.
Seriously, the sooner this crazy bubble bursts the better. I thought the shitcoin mining days were bad but at least everyone back then knew the game for GPUs was over once the first Bitcoin ASIC was released, but now? No end in sight and frankly I'm pissed.
The AI bubble will burst the same way the internet bubble did:
First explosively, then never.
It can change quickly. Great example is the short domination of the ati 9700 that crushed nvidia for a short while.
> If, by the grace of tech Jesus, amd gave us such systems at volumes Nvidia would notice, Nvidia would simply then do the same but with a better ecosystem.
Not if they have "a better ecosystem" -- they would continue to charge a premium for that.
Which creates a dilemma for Nvidia. If they would match AMD's pricing, they'd be losing all the money they could get by charging more, which is a ton. Whereas if they charge more, they get more today from the people who pay the premium, but some people are more price sensitive than others, so there are still a lot of people who would buy "lots of VRAM for less money" from AMD. And soon AMD has a lot of users, improves their software support and the difference disappears entirely.
Forcing the larger competitor into that dilemma is very much to the advantage of the smaller competitor.
For traditional LLMs this might be true (especially large MoEs at bs=1) but I highly disagree with "multi-modal models" phrase since most of the models that output in other modalities are generally compute bound. Which means less flops will make the experience so much worse (imagine waiting a couple minutes for an image and hours for videos).
So the 300A is an accelerator coupled with a full 24-core EPYC and 128GB of HBM all on a single chip (or, packaged chiplets, whatever).
Why is it I can't buy a single one of these, on a motherboard, in a workstation format case, to use as an insane workstation? Assuming you could program for the accelerator part, there is an entire world of x86-fixed CAD, engineering, and entertainment industry (rendering, etc) where people want a single, desktop machine with 128GB + of fast ram to number crunch.
There are Blender artists out there that build dual and quad RTX4090 machines with Threadrippers for $20k+ in components all day, because their render jobs pay for it.
There are engineering companies that would not bat an eye at dropping $30k on a workstation if it mean they could spin around 80 gigabyte CATIA models of cars or aircraft loaded in RAM quicker. I know this at least because I sure as hell did with with several HP Z-series machines costing whole-Toyota-Corolla prices over the years...
But these combined APU chips are relegated to these server units. In the end is this a driver problem? Just a software problem? A chicken and egg problem where no one is developing the support because there isn't the hardware on the market, and there isn't the hardware on the market because AMD thinks there is no use case?
Edit: and note my use cases mentioned don't rely on latency, really, like videogamers need to hit framerates. The cache miss latency mentioned in the article doesn't matter as much for these type of compute applications where the main problems are just loading and unloading the massive amount of data. Things like offline renders and post-processing CFD simulations. Not necessarily a video output framerate.
(I run a company that buys MI300x.)
> Why is it I can't buy a single one of these, on a motherboard, in a workstation format case, to use as an insane workstation?
AMD doesn't have the resources to support end users for something like this. They are a public company, look at their spend. They are pouring everything they've got into trying to keep up with the Nvidia release cycle for AI chips.
These chips are cutting edge, they are not perfect. They are still working through the hardware and software issues. It is hard enough to deal with all the public opinion on things as it is. Why would they add another layer of potential abuse?
The people who buy stuff like that are professionals. They often know something about the tools they're using and if there are any problems, provide bug reports that actually describe what's happening instead of some non-descriptive mush like "I have your GPU and Windows crashes sometimes". That is extremely helpful if you're trying to get rid of those bugs.
This is the same reason software shops have found it useful to support Linux, even if not many people use it. The people who do will make your product suck less, which in turn makes it easier to sell to the mass market, which will get upset and think unfavorably of you if they have the same problem but not be as good at telling you about it.
Groq is a good example here:
https://www.eetimes.com/groq-ceo-we-no-longer-sell-hardware/
Our users give them plenty of feedback. They just RMA'd whole bunch of our GPUs over this issue so that they could take them back to the mothership and figure out what's up...
https://github.com/ROCm/ROCm/issues/4021
It takes a lot of coordination, across ourselves (with customers), our DC, AMD and Dell to make that happen.
It's not that you don't get bug reports from data center customers, it's that data center customers have scale in a bad way. They buy thousands of GPUs, they do whatever they're going to do with them, they have a problem, they report the bug. One bug report across thousands of GPUs, because they're all being used for the same thing by that customer so they only have the problems you have when you try to do that. Another data center buys thousands of GPUs and they're doing something else which is extremely common and well supported, so they don't have any issues and you get zero bug reports from them.
Compare this to, you sell a thousand GPUs to a thousand professionals and 10% of them have some problem, but each a different one. You get 100 bug reports, you fix 100 bugs instead of just one, things improve much faster.
We have 136 of these things. Not thousands. AMD is intentionally keeping their number of providers limited [0](bottom of page).
No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.
These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.
If you're now also suggesting that AMD also release another product that is easier for developers to get their hands on and deploy, then now you've totally lost me. You're exponentially trying to increase the amount of work and money they spend, for what? Some feedback?
[0] https://www.amd.com/en/products/accelerators/instinct.html
I think you underestimate the people here when you throw around things like "it costs as much as an expensive Ferrari". a lot of us work with systems like these, so we understand why they cost so much and what they can do. On Reddit this works, here, I feel this is pretty condescending.
"Intentionally limiting" is just koolaid. It's ok to drink it, it's your business, but it's koolaid. You think if AWS wanted to deploy a couple hundred thousand of these systems, AMD would be sad? I bet Lisa would be happy.
I tried renting a system, and putting in a credit card is not enough. That's a red flag for me. I don't want to email, chat with sales, etc, just put in a card number. This works for even GH200 systems over at lambda.
As for number of SKUs, for Blackwell there are a lot, if you believe Jensen, and why wouldn't you? He stated at CES that almost every DC they go into is a bit bespoke with modifications.
AMD seems unable to execute on this, which is reflected in its share price.
> I feel this is pretty condescending
Apologies, not my intention.
> I bet Lisa would be happy.
I bet too! I was referring to neoclouds, not tier 1.
> I tried renting a system, and putting in a credit card is not enough.
You truly don't need to talk to anyone, CC and go: https://www.shadeform.ai/
> AMD seems unable to execute on this, which is reflected in its share price.
I agree, they haven't been doing the best job [0]. Let's hope they can show action and turn it around.
[0] https://x.com/HotAisle/status/1880679135875362839
Ok, maybe it works now just by CC. Glad that's sorted.
AMD is tone deaf unfortunately, but I liked your reply on X.
> We have 136 of these things. Not thousands.
That's a number within an order of magnitude, and you're presumably not the largest provider.
> No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.
If you own something and you've having problems with it, you're more inclined to try to solve them. If you're renting something and you have problems with it, you're more inclined to rent something else instead.
> These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.
Making only 4-socket systems was a choice.
You're also acting like multiple SKUs are something weird. Start offering Ryzen APUs with some on-package GDDR or HBM. Make something that fits in the Threadripper socket and uses PCIe power connectors for extra power. People would buy these things.
The point is to create lots of systems in the hands of lots of people that use the same general hardware architecture so that you're improving its software support.
[dead]
> provide bug reports that actually describe what's happening
Doesn’t matter if the bug reports are good or bad. Supporting low volume applications is a bad business move when the alternative is 9-figure data center contracts.
The data center business is orders of magnitude larger. Trying to support individual developers would be a huge business mistake when they already can’t keep up with data center.
It's the same hardware running the same software. You want the bug reports so you can fix them and then your data center customers don't encounter them when they're evaluating your product.
What they can keep up with is basically a matter of how much capacity they order from TSMC. If they underestimated demand for some generation, that's the sort of thing you fix with the next contract or you're just throwing money away.
Data center orders are high volume and allow long lead times. You can collect orders, collect money, and then agree when to deliver huge batches of product.
Selling one off chips isn’t attractive at all in this context. Selling a couple parts to the rare Blender artist is nothing relative to the data center buildouts with billion dollar budgets.
Every one-off part you sell takes resources and inventory away from landing those big contracts.
Supermicro sell them, https://www.supermicro.com/en/accelerators/amd. Other companies probably do too. I'm excited about the ~100W class APUs just announced at CES, hoping for one in a vesa mount format.
Does AMD build a DGX-like device?
https://www.nvidia.com/en-us/data-center/dgx-platform/?ncid=...
We are trialing some AMD GPUs right now, otherwise we are all NVIDIA.
its interesting that two simultaneous and contradictory views are held by AI engineers:
- Software is over
- An impenetrable software moat protects Nvidia's market capitalization
>Still, core to core transfers are very rare in practice. I consider core to core latency test results to be just about irrelevant to application performance. I’m only showing test results here to explain the system topology.
How exactly are "applications" developed for this? Or is that all proprietary knowledge? TinyBox has resorted to writing their own drivers for 7900 XTX
ROCm is the stack that people write code against to talk to AMD hardware.
George wrote some incomplete non-perfomant drivers for a consumer grade product. Certainly not an easy task, but it also isn't something that most people would use. George just makes loud noises to get attention, but few in the HPC industry pay any attention to him.
Yes ROCm is for the GPU, but the MI300A also includes 4 clusters of cpus connected by an infinity fabric. Generally this kind of thing is handled by the OS but there is no OS for this product.
AMD has had APU's for years, the PS5 chip is an APU.
I did a quick google search and found this presentation which details the programming model...
https://nowlab.cse.ohio-state.edu/static/media/workshops/pre...
AMD has been doing IF-connected CCDs/chiplets for a while now - since Zen 1, released in 2017. All the x86 OSes work fine on each iteration.
Application authors who care about wringing out the last drop of performance need to be mindful about how they manage processes and cache lines on this hardware - as they would on any other architecture
What do you mean no OS? You log into whatever Linux distribution someone put on it, that's one of the better magic tricks from having a collection of x64 cores on the same chip. Or I suppose you roll a unikernel style system if you want to.
Nobody cares what HPC industry has to say; until recently, they have happily been jerking off Monte-Carlo simulations on overpriced nation-grade supercomputer NUMA clusters and didn't know what a "GPU" was anyway! Also please stop spreading "consumer grade product" propaganda. I had used AMD Instinct MI50's—supposedly datacenter-grade hardware, and have faced the exact same problems as George. Except in my case there was no call-line at Lisa's.
Guess what, the AI industry has spoken: hyper-scalers would buy NVIDIA, or rather design their own silicon. Any thing, any how, but nothing to do with AMD.
Also: if your business is doing so great, how come you're constantly in all these Hacker News threads talking and talking and talking but not actually releasing products of any kind, of any bread, that any of the hackers on here could use?
> but not actually releasing products of any kind, of any bread, that any of the hackers on here could use?
Our "product" is open access to a very specific type of HPC compute that previously was locked up and only available to a short list of researchers.
Thanks for asking, we just added 1 GPU / 1 minute docker container access through our excellent partners: https://shadeform.ai
1 GPU / 1 VM / 1 minute is coming soon.
From the looks of it, YOU ARE the product. That is, manufacturing optics of a "partner" and "distributer" ecosystem for AMD. And on borrowed time, too.
> From the looks of it, YOU ARE the product.
Sweet, thanks! That's at least part of what a CEO is supposed to be.
Please don't be salty; the only person here who may embarrass you is yourself. I'm happy that you like to think about yourself as CEO, but perhaps it's worth reflecting you may be doing a better job if you had spent less time on Hacker News, and more time figuring out how to get Hacker News excited about your product? So far you have pledged allegiance to AMD every chance you got, and spun tall tales of great capability, with not much to show for it besides "partners." You know nobody has trained a thing with your GPU's yet? That would be a great place to start for a CEO. To make something people would use. To justify it to us; as AMD themselves have clearly justified your existence there's no work there!
It's just tough words from a nobody, don't worry you'll be fine!
> You know nobody has trained a thing with your GPU's yet?
https://x.com/zealandic1/status/1877005338324427014
That is quite a thing. I've been out of the 'design loop' for chips like this for a while so I don't know if they still do full chip simulations prior to tapeout but woah trying to simulate that thing would take quite the compute complex in itself. Hat's off to AMD for getting it out the door.
I'm curious why this space hasn't been patented to death.
It has been. All sides have a pile of patents. All sides violate all the other sides’ patents. If anyone sues, everyone goes out of business.
This is the system working as currently intended. No matter what happens, the lawyers get will rich.
If a small company comes in and doesn’t pay the lawyers, it’ll get sued for violating the patents.
Yep, it's an area denial weapon.
You basically cannot do anything worthwhile in this space without violating someone's patents. It's beneficial patent and corporate lawyers, but it's detrimental to innovation. As an engineer you are asked to not look up existing techniques or designs as this will taint you legally.
"Tainting" isn't a thing in patent law. All engineers worldwide are tainted the moment the patent is published; that's why parallel reinvention is not a defense to patent infringement.
But you pay triple damages if you knowingly vs unknowingly violate a patent (35 U.S.C. § 284). Of course, everything is patented, so, engineers are just told to not read patents.
Mutually assured destruction
Where do patent-trolls fit in this analogy?
They’re the backstreet gangs that rob single missile silos from failing states and ransom you.
> If a small company comes in and doesn’t pay the lawyers, it’ll get sued for violating the patents.
This assumes the small company isn't just in it for the patents.
MI300 is an insanely good GPU. There is nothing that Nvidia sells that even comes close. The H100 only has 80GB of memory, whereas MI300 has 192GB. If you are training large models, AMD is the way to go.
H200has more memory and B200 just takes it even more home with cluster wide NVLink.
AMD has zero response to it.
1.8TB/s interconnect, check.
AWS has this on their new platform as well.
H200 has less memory than MI300
[dead]
AMD is done, no one uses their GPUs for AI because AMD were too dumb to understand the value of software lock-in like Nvidia did with CUDA.
Funny you say that, because nobody serious about AI is actually using Nvidia unless they're already locked in with CUDA.
Highest performing inference engines all use Vulkan, and are either faster per dollarwatt on the CDNA3 cards or (surprisingly) the RDNA3 cards, not Lovelace.
> Funny you say that, because nobody serious about AI is actually using Nvidia unless they're already locked in with CUDA.
Yeah right, so Meta and XAI buying hundreds of Nvidia's H100's was because they were not serious in AI. wtf
Meta has an in-house accelerator that the Triton inference engine supports (which they use almost exclusively for their fake content/fake profiles project). Triton is legacy software and, afaik, does not have a Vulcan backend, so Meta may be locked out of better options until it does.
That doesn't stop Meta's Llama family of models running on anything and everything _outside_ of Meta, though. Llama.cpp works on everything, for example, but Meta doesn't use it.
CUDA lock-in is not what it once was. I do a lot of stable diffusion and I was pleasantly suprised that I could just run the same code on AMD with no changes.
More like the value of drivers that doesn't require one in-house team per customer to "fix" driver crashes in the customers' particular workloads.
Yeah, the labour involved in running non Nvidia equipment is the elephant in the room.
Nvidia GPU: spin up OS, run your sims or load your LLM, gather results.
AMD GPU: spin up OS, grok driver fixes, try and run your sims, grok more driver fixes, can't even gather results until you can verify software correctness of your fixes. Yeah, sometimes you need someone with specialized knowledge of numerical methods to help tune your fixes.
... What kind of maddening workflows are these? It's literally negative work: you are busy, you barely get anywhere, and you end up having to do more.
In light of that, the Nvidia tax doesn't look so bad.