Many people have been asking them for this sort of content, and it is happening. Can't be more excited. Also note that it is AMD, but not AMD. Being published in the open on an individual github.
Whenever I see code like this, I'm starting to think that GPUs are uniquely unsuited for matrix multiplication.
You're pretending that each streaming multiprocessor can handle independent threads, when in reality you're feeding something that only exists once or twice per SM. It's like independently controlling one out of 32 cars on a 32 lane highway where the cars aren't allowed to switch lanes and having the controls on one car replicated to all the others when in reality everyone is sitting in the same bus.
I'm not sure I follow. Matrix multiplication isn't inherently 'branchy' in a way that we would expect to cause inefficient execution on SIMT (i.e. branch divergence).
I think the remark is more about Tensor Cores (or Matrix Cores in AMD lingo) are distributed by SM (and not aside on an interconnect and individually programmable) so on the same SM you have your classical warps (cuda cores) AND the Tensor units and switching between one and the other might be confusing.
My vision of SMs has always been "assume AVX512 is the default ISA" and "tensor cores are another layer aside of this" (kind-of like AMX) and you have this heterogeneous "thing" to program. Don't know if it helps. The CUDA programming model hides a lot and looking at PTX code in nsight-compute is most enlightening.
Glad to see more articles out using AMD hardware acceleration especially for matrix math. More diversity in this space is welcome.
Many people have been asking them for this sort of content, and it is happening. Can't be more excited. Also note that it is AMD, but not AMD. Being published in the open on an individual github.
Whenever I see code like this, I'm starting to think that GPUs are uniquely unsuited for matrix multiplication.
You're pretending that each streaming multiprocessor can handle independent threads, when in reality you're feeding something that only exists once or twice per SM. It's like independently controlling one out of 32 cars on a 32 lane highway where the cars aren't allowed to switch lanes and having the controls on one car replicated to all the others when in reality everyone is sitting in the same bus.
I'm not sure I follow. Matrix multiplication isn't inherently 'branchy' in a way that we would expect to cause inefficient execution on SIMT (i.e. branch divergence).
I think the remark is more about Tensor Cores (or Matrix Cores in AMD lingo) are distributed by SM (and not aside on an interconnect and individually programmable) so on the same SM you have your classical warps (cuda cores) AND the Tensor units and switching between one and the other might be confusing.
My vision of SMs has always been "assume AVX512 is the default ISA" and "tensor cores are another layer aside of this" (kind-of like AMX) and you have this heterogeneous "thing" to program. Don't know if it helps. The CUDA programming model hides a lot and looking at PTX code in nsight-compute is most enlightening.