Benchmarking multi-lingual open-source LLMs

People: David

Idea: Based on conversations with Keao about machine-translation benchmarking, I ran a bunch of LLM models through MMLU-ProX (lite) French benchmark tests (biology section + full 14-topic suite) to see which ones actually deliver the best mix of speed and accuracy in French. This started because we hypothesized that Deepseek R1 could be a better translator than OSS-120B and also because we were curious how Mistral's recent Magistral and Large 2 (announced at the same time as Pixtral Large) models compared.

Details:

  • Qwen3-235B-Instruct-2507 (Q2 quantization) won overall: 88.9% on biology, 80% on the full suite
  • The AWQ-4bit version of Qwen3-Next-80B was shockingly fast and pretty accurate - 12.5 minutes for all 14 topics (~0.89 min/task)
  • Quantization isn't always straight-forward, - Q2 beat Q4 by 2.8 points and ran 10× faster
  • DeepSeek-R1 tied for top biology score (88.9%) but crawled through the full suite at 22.5 min/task
  • Mistral models surprisingly did not get very good accuracy (59-67% accuracy on full suite) and since they are dense models, could not compete on speed
  • Deployment stack/hardware matters enormously: same 80B model ran 10× faster on AWQ vs MLX
  • Biology was the easiest domain (84.5% average), law was brutal (43.2% average)
  • Weirdly, spending more time didn't correlate with better scores - sometimes the opposite
  • The Pareto frontier is pretty clear: Llama-4-Scout if you need speed (0.5 min), Qwen3-Next-80B AWQ for balance, Qwen3-235B if you need accuracy
  • Most models cluster around 75-80% on the full suite, but runtime varies wildly (12 min to 5+ hours)
  • If you're picking a model for production, look at the per-task speed - averages hide a lot

Read more