Show HN: Nvidia's CUDA libraries are generic and not optimized for LLM inference

1 points | by venkat_2811 11 hours ago

1 comments

venkat_2811 11 hours ago
With so much improvements in LLM Inference Kernels, Inter-GPU comms are becoming the bottleneck. Introducing my project YALI - Yet Another Low-Latency Implementation.
A custom CUDA kernel library that provides ultra low-latency primitives for inter-gpu comms collectives. Achieves 80-85% Speed-of-Light SW efficiency on p2p all_reduce_sum over NVLINK on 2xA100 GPUs.
It outperforms NVIDIA NCCL by 2.4x and over 50x stable tail latency.
https://venkat-systems.bearblog.dev/yali-vs-nvidia-nccl/