diffuse-cpp goes open source
Today I'm releasing diffuse-cpp under the Apache-2.0 license.
diffuse-cpp is a C++ inference engine for Diffusion Language Models, built on GGML. It supports two models: LLaDA-8B (Llama backbone) and Dream-7B (Qwen2.5 backbone).
Why diffusion on CPU?
Autoregressive LLMs are memory-bound on CPU. Each token requires reading the entire weight matrix, and your memory bandwidth is the bottleneck. Adding more CPU cores doesn't help because they all share the memory bus.
Diffusion LLMs flip this. They generate all tokens in parallel through matrix-matrix multiplication, making them compute-bound. Thread scaling goes to 7.4x at 12 cores, compared to 2.4x for autoregressive.
The numbers
On an AMD EPYC 12-Core with Q4_K_M quantization:
The key innovations are entropy_exit (the model decides how many denoising steps to use) and inter-step KV cache (1.6x speedup by reusing stable K,V between steps).
What's next
This is the only C++ inference engine for diffusion LLMs. There's a lot of room to build:
Contributions welcome.