How Much GPU Memory Does NexusQuant Actually Save?
How Much GPU Memory Does NexusQuant Actually Save? KV cache compression numbers like "10x" sound impressive in a paper. But what does that mean in practice, for a real GPU, serving real users? Let ...

Source: DEV Community
How Much GPU Memory Does NexusQuant Actually Save? KV cache compression numbers like "10x" sound impressive in a paper. But what does that mean in practice, for a real GPU, serving real users? Let me give you a concrete memory calculator so you can answer this for your own setup. The KV Cache Formula For any transformer, the KV cache size is: KV_bytes = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element The 2 is for keys AND values. bytes_per_element is 2 for FP16, 4 for FP32. For Mistral-7B (32 layers, 8 KV heads, head_dim=128, FP16): KV_bytes = 2 × 32 × 8 × 128 × seq_len × 2 = 131,072 × seq_len bytes ≈ 128 KB per token At 128K tokens: 128 KB × 131,072 = 16.7 GB just for the KV cache. GPU Memory Table Here's what that means on real hardware: GPU VRAM Max KV tokens (FP16, no NQ) With NexusQuant 10x With NexusQuant 17x With NexusQuant 33x RTX 3090 24 GB ~150K ~1.5M ~2.6M ~5M A10G 24 GB ~150K ~1.5M ~2.6M ~5M A100 40GB 40 GB ~256K ~2.6M ~4.4M ~8.5M A100 80GB 80 GB ~512K ~