diffuse-cpp is now Apache-2.0

diffuse-cpp goes open source

Today I'm releasing diffuse-cpp under the Apache-2.0 license.

diffuse-cpp is a C++ inference engine for Diffusion Language Models, built on GGML. It supports two models: LLaDA-8B (Llama backbone) and Dream-7B (Qwen2.5 backbone).

Why diffusion on CPU?

Autoregressive LLMs are memory-bound on CPU. Each token requires reading the entire weight matrix, and your memory bandwidth is the bottleneck. Adding more CPU cores doesn't help because they all share the memory bus.

Diffusion LLMs flip this. They generate all tokens in parallel through matrix-matrix multiplication, making them compute-bound. Thread scaling goes to 7.4x at 12 cores, compared to 2.4x for autoregressive.

The numbers

On an AMD EPYC 12-Core with Q4_K_M quantization:

LLaDA-8B: 15-28 tok/s on easy prompts (up to 3.3x faster than llama.cpp)

Dream-7B: 11-22 tok/s (excels at math and code)

llama.cpp baseline: 8.51 tok/s

The key innovations are entropy_exit (the model decides how many denoising steps to use) and inter-step KV cache (1.6x speedup by reusing stable K,V between steps).

What's next

This is the only C++ inference engine for diffusion LLMs. There's a lot of room to build:

Integrated tokenizer

GPU offloading via GGML (Metal, Vulkan, CUDA)

Batched inference for serving

More model architectures

Contributions welcome.

GitHub

Paper (Zenodo)

GGUF Models (HuggingFace)