4bit Quant Showdown: Finding the Sweet Spot for Qwen3 Models
People: David
Idea: To setup our Kumubot cluster, I went down a rabbit hole benchmarking different quantization methods for Qwen3 8B and 32B models to see which ones actually deliver the best accuracy-to-speed tradeoff in real-world use
Details:
• ExLlamaV3-4bpw surprisingly beat even BF16 on LiveBench accuracy (60.0 vs 58.2) - turns out quantization can sometimes act like a regularizer
• AWQ-4bit is your speed demon for interactive use - fastest single-request latency across the board
• NVFP4 pulls ahead when you're juggling 10+ concurrent requests, making it the throughput champion
• GGUF formats (Q4_K_M, UD-Q4_K_XL) are accuracy monsters but run like molasses - great for offline batch processing
• Different quants can excel at different tasks: in this testing, ExLlama crushes coding/math, Q4_K_M owns instruction following, UD-Q4_K_XL dominates reasoning
• The 8B model tells a different story - GGUF formats win on accuracy while AWQ still takes the speed crown
• Apple folks: MLX 4-bit DWQ is your best bet, beating both standard MLX-4bit and MXFP4-MLX
• Wall-clock times are brutal - BF16 takes 5.5x longer than NVFP4 to run the same LiveBench suite
• The "just use FP8" advice doesn't hold up - it's middle-of-the-pack on both speed and accuracy (given the current software/kernel situation for Blackwell cards)
• Bottom line: pick ExLlama for max accuracy at 4-bit, AWQ for snappy chat, NVFP4 for Blackwell card experiments (that may eventually be the best)