Unsloth AI releases MTP-optimized GGUF files for Qwen3.6-27B and Qwen3.6-35B-A3B on Hugging Face delivering 1.4 to 2.2 times faster generation
llama.cpp merged native MTP support on May 16 for Qwen3.6 models.
I've seen some confusion online on how to run llama.cpp with MTP (Multi-token prediction) in the simplest way possible.
ICYMI, MTP is a new flavor of speculative decoding built-in to the model itself, that ~2x your tokens per sec for most use cases.
2x generation speed = Truly a game changer. 🔥
How to run it?
brew upgrade llama.cpp # or you might need to install from source until build 9200 is in your package manager: brew install llama.cpp --HEAD
Then pick either the Dense 27B or the 35B A3B MoE.
Personally I tend to stick to the Dense model where I achieve ~30 tok/sec on my machine. The MoE is of course way faster at an impressive ~100 tok/sec on my machine. Truly rapid. ⚡️
In both cases you probably want 48GB or better 64GB RAM or VRAM, though 36GB might work with more strongly-quantized versions.
# Dense:
llama-server -hf ggml-org/Qwen3.6-27B-MTP-GGUF --spec-type draft-mtp --spec-draft-n-max 2
# MoE:
llama-server -hf ggml-org/Qwen3.6-35B-A3B-MTP-GGUF --spec-type draft-mtp --spec-draft-n-max 3
Enjoy!

finally faster Qwen3.6 models with MTP support ⚡️
brb updating my Pi & Hermes setup 🤝
llama.cpp adds MTP for the Qwen3.6 family This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further. Special thanks to Aman Gupta for leading this development! https://github.com/ggml-org/llama.cpp/pull/22673
llama.cpp adds MTP for the Qwen3.6 family
This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further.
Special thanks to Aman Gupta for leading this development!