Qwen3-32B on AMD's 7900XTX
People: Me
Idea: I wanted to see how well the Qwen3-32B model runs on an AMD 7900XTX using different quantization formats and inference backends—spoiler: AWQ is not the move on this generation of consumer AMD card.
Details:
- Tested on a host system with Ubuntu 24.04 with ROCm 7.1.1, using both Ollama and dockerized vllm
- AWQ quants technically work in vllm but are painfully slow—around 5 tokens/sec for a ~700 input / ~70 output generation
- Ollama with Q4KM hit about 25 tokens/sec, which is respectable
- The winner was Qwen3-32B-autoround-4bit-gptq in vllm at ~35 tok/sec single request
- Bumping batch size to 3 concurrent requests pushed that to ~40 tok/sec
- Getting vllm running on AMD is still an adventure - I tried building from source using the rocm Dockerfile, rocm-dev nightlies, AMD TheRock images, and community Docker images from /r/localllama
- Some of those setups couldn't run AWQ at all
- Also gave Vulkan a spin in Ollama for comparison
- Bottom line: if you're on AMD and want decent Qwen3-32B performance, look for autoround GPTQ quants instead of AWQ
- The ROCm ecosystem is getting better but still requires some patience and experimentation