4bit Quant Showdown: Finding the Sweet Spot for Qwen3 Models

David Pickett

04 Nov 2025 — 1 min read

People: David

Idea: To setup our Kumubot cluster, I went down a rabbit hole benchmarking different quantization methods for Qwen3 8B and 32B models to see which ones actually deliver the best accuracy-to-speed tradeoff in real-world use

Details:

• ExLlamaV3-4bpw surprisingly beat even BF16 on LiveBench accuracy (60.0 vs 58.2) - turns out quantization can sometimes act like a regularizer

• AWQ-4bit is your speed demon for interactive use - fastest single-request latency across the board

• NVFP4 pulls ahead when you're juggling 10+ concurrent requests, making it the throughput champion

• GGUF formats (Q4_K_M, UD-Q4_K_XL) are accuracy monsters but run like molasses - great for offline batch processing

• Different quants can excel at different tasks: in this testing, ExLlama crushes coding/math, Q4_K_M owns instruction following, UD-Q4_K_XL dominates reasoning

• The 8B model tells a different story - GGUF formats win on accuracy while AWQ still takes the speed crown

• Apple folks: MLX 4-bit DWQ is your best bet, beating both standard MLX-4bit and MXFP4-MLX

• Wall-clock times are brutal - BF16 takes 5.5x longer than NVFP4 to run the same LiveBench suite

• The "just use FP8" advice doesn't hold up - it's middle-of-the-pack on both speed and accuracy (given the current software/kernel situation for Blackwell cards)

• Bottom line: pick ExLlama for max accuracy at 4-bit, AWQ for snappy chat, NVFP4 for Blackwell card experiments (that may eventually be the best)

Testing Qwen3-Omni Audio Inputs

People David Idea Part of our work in the Kumubot cluster involves being able to work on both text as well as audio recordings - the idea was the figure out the capabilities of Qwen3-Omni on our Blackwell hardware Details * Blackwell software support (at least for the RTX 6000 Pro)

Trying AI-Trader on local LLMs

People David Idea Joe wanted to know what the capabilities of https://github.com/HKUDS/AI-Trader look like in the context for local LLMs Details * Looks like the assumption of the repo is that you can test multiple AI in agentic fashion against historical stock market data over a period

Benchmarking multi-lingual open-source LLMs

People: David Idea: Based on conversations with Keao about machine-translation benchmarking, I ran a bunch of LLM models through MMLU-ProX (lite) French benchmark tests (biology section + full 14-topic suite) to see which ones actually deliver the best mix of speed and accuracy in French. This started because we hypothesized that

Local hardware fine-tuning LLMs for Hawaiian-English translation benchmarking

People * David Idea Exploring memory-efficient fine-tuning techniques for improving Hawaiian-to-English translation using Apple's MLX framework, comparing multiple approaches and optimizing for Mac hardware. Details * Successfully fine-tuned gemma-3-4b-it-4bit on Mac M1 Ultra (128GB RAM) achieving 0.8296 semantic similarity score, a 3.6% improvement over the base model * Discovered

Read more

Testing Qwen3-Omni Audio Inputs

Trying AI-Trader on local LLMs

Benchmarking multi-lingual open-source LLMs

Local hardware fine-tuning LLMs for Hawaiian-English translation benchmarking