ds4.c — DeepSeek V4 Flash Local Inference Engine for Metal

🟢 READ | ⏱ 8 min | 📡 8/10 | 🎯 ML engineers, local inference builders

TL;DR

ds4.c is a streamlined native inference engine for DeepSeek V4 Flash optimized for Apple Silicon. It achieves impressive performance (84–468 tokens/sec prefill on M3 Ultra) through specialized 2-bit quantization and compressed KV caching, enabling 1M token context on 128GB Macs without cloud dependency.

Signal

Custom 2-bit quantization strategy (only routed MoE experts compressed; shared experts, projections untouched for quality)
Compressed KV cache with disk persistence allows long-context inference on local machines
1M token context window with official logits validation + HTTP-compatible APIs (OpenAI & Anthropic formats)

What They're NOT Telling You

The project explicitly admits it's "alpha quality" and GPU/CUDA support is uncertain ("may implement… perhaps, but nothing more"). CPU inference is broken on modern macOS due to a kernel virtual memory bug they couldn't fix. This is optimized for a single use case—don't expect generic GGUF loading or broad model support.

Trust Check

Factuality ✅ | Author Authority ✅ | Actionability ✅