OP here. This is Part 2 of my reproduction series. I scaled the experiment from 10M params (MacBook) to 1.7B params (8x H100s) to test DeepSeek's instability claims.
The paper reported 3,000x signal amplification. I found 10,924x.
The "Instability Bomb" findings:
- The Scaling Law: It's strictly worse at scale. 10M → 9x, 1.7B → 10k x.
- The Culprit: It's Layer 0. The first mixing matrix eats raw embeddings without LayerNorm and immediately amplifies them.
- The Twist: Despite 10,000x amplification, the model didn't diverge. It kept learning, likely saved by gradient clipping.
I’ve posted the full logs and Amax graphs in the post. Happy to answer questions about the H100 cluster setup or the Sinkhorn projection math.
OP here. This is Part 2 of my reproduction series. I scaled the experiment from 10M params (MacBook) to 1.7B params (8x H100s) to test DeepSeek's instability claims.
The paper reported 3,000x signal amplification. I found 10,924x.
The "Instability Bomb" findings:
- The Scaling Law: It's strictly worse at scale. 10M → 9x, 1.7B → 10k x.
- The Culprit: It's Layer 0. The first mixing matrix eats raw embeddings without LayerNorm and immediately amplifies them.
- The Twist: Despite 10,000x amplification, the model didn't diverge. It kept learning, likely saved by gradient clipping.
I’ve posted the full logs and Amax graphs in the post. Happy to answer questions about the H100 cluster setup or the Sinkhorn projection math.