Qwen3-32B on AMD's 7900XTX

People: Me

Idea: I wanted to see how well the Qwen3-32B model runs on an AMD 7900XTX using different quantization formats and inference backends—spoiler: AWQ is not the move on this generation of consumer AMD card.

Details:

  • Tested on a host system with Ubuntu 24.04 with ROCm 7.1.1, using both Ollama and dockerized vllm
  • AWQ quants technically work in vllm but are painfully slow—around 5 tokens/sec for a ~700 input / ~70 output generation
  • Ollama with Q4KM hit about 25 tokens/sec, which is respectable
  • The winner was Qwen3-32B-autoround-4bit-gptq in vllm at ~35 tok/sec single request
  • Bumping batch size to 3 concurrent requests pushed that to ~40 tok/sec
  • Getting vllm running on AMD is still an adventure - I tried building from source using the rocm Dockerfile, rocm-dev nightlies, AMD TheRock images, and community Docker images from /r/localllama
  • Some of those setups couldn't run AWQ at all
  • Also gave Vulkan a spin in Ollama for comparison
  • Bottom line: if you're on AMD and want decent Qwen3-32B performance, look for autoround GPTQ quants instead of AWQ
  • The ROCm ecosystem is getting better but still requires some patience and experimentation

Read more