Running DeepSeek-V4-Flash at 700 tokens/s on 2x RTX Pro 6000
Run DeepSeek-V4-Flash on a 2x RTX Pro 6000 (96GB each) workstation using the voipmonitor/vllm:lucifer Docker image, a Blackwell-targeted vLLM fork with sm_120 kernels, FP8 KV cache, and MTP speculative decoding.
DeepSeek-V4-Flash is an open weights (MIT-licensed) model from the Chinese lab DeepSeek, released in April 2026. It is a Mixture-of-Experts model with 284B total parameters but only 13B active per token, and it supports a 1M-token context. Despite the small active footprint, it punches well above its weight class: it scores around 40 on the Artificial Analysis Intelligence Index, versus a median of about 24 for comparable open weights models. That puts it level with GPT-5.4 mini (xhigh) and ahead of frontier-lab models like Grok 4.3 (high) and Claude 4.5 Haiku, all while running locally on two GPUs.
On a single 2-GPU workstation (2x RTX Pro 6000, TP2), DeepSeek-V4-Flash serves at about 210 tokens/sec on a single stream and scales to roughly 700 tokens/sec aggregate across 10 concurrent requests, with sub-second time-to-first-token all the way to 10 streams and prefill saturating near 10,000 tokens/sec. This is the original DeepSeek-V4-Flash weights, not a quantized version. The full benchmark is at the end of this article.
I got these numbers with voipmonitor/vllm:lucifer, a Blackwell-targeted fork of vLLM built from the lucifer branch of local-inference-lab/vllm. The image ships a CUDA 13.2 / PyTorch 2.12 stack with pinned FlashInfer, DeepGEMM, CUTLASS, and a patched NCCL, all compiled for sm_120a, the compute capability of the RTX Pro 6000 (Blackwell).
The steps below are based on the original reference notes in local-inference-lab/rtx6kpro, specifically its Standard Lucifer Cutlass path. That page also documents a faster B12X build and full benchmark numbers if you want to go deeper.
What the fork changes for the RTX Pro 6000
- sm_120 kernels.
SPARSE_MLA_SM120attention and a FlashInfer/CUTLASS MoE path (flashinfer_cutlass) are compiled for Blackwell rather than running from generic PTX. - FP8 where it matters. FP8 KV cache and DeepGEMM UE8M0 FP8 GEMMs fit the 262K context and improve throughput within 2x 96 GB.
- MTP speculative decoding. DeepSeek’s multi-token-prediction head gives a single-stream decode speedup.
- PCIe-aware TP2. NCCL is tuned (
NCCL_P2P_LEVEL=SYS) for the no-NVLink, PCIe-host-bridge topology of these workstation cards. - Cached JIT/autotune. FlashInfer autotune, DeepGEMM warmup, and CUDA-graph capture are written to a
/cachevolume, so the ~5-minute first-run warmup happens once.
Hardware and assumptions
1
2
3
4
GPU: 2x NVIDIA RTX Pro 6000 (96 GB VRAM each, sm_120, no NVLink)
NVIDIA driver: recent, CUDA 13 capable
Docker: with the NVIDIA container runtime
Disk: ~160 GB free
The server runs tensor-parallel across both GPUs (TP2).
1. Pull the runtime image
1
docker pull voipmonitor/vllm:lucifer
2. Download the model (~160 GB)
1
2
3
pip install -U "huggingface_hub[cli]"
hf download deepseek-ai/DeepSeek-V4-Flash \
--local-dir /models/DeepSeek-V4-Flash
3. Start the server
This runs in the foreground. Ctrl+C stops and removes the container. Port 8000 is exposed on the host.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
docker run --rm -it --init --name ds4 \
--gpus all --runtime nvidia --ipc host --shm-size 32g --network host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v /models:/models \
-v /root/.cache/lucifer:/cache \
-e CUDA_VISIBLE_DEVICES=0,1 \
-e CUTE_DSL_ARCH=sm_120a \
-e HF_HUB_OFFLINE=1 \
-e NCCL_P2P_LEVEL=SYS -e NCCL_PROTO=LL,LL128,Simple -e NCCL_IB_DISABLE=1 \
voipmonitor/vllm:lucifer \
/bin/bash -lc 'unset NCCL_GRAPH_FILE NCCL_GRAPH_DUMP_FILE VLLM_CACHE_DIR; \
exec vllm serve /models/DeepSeek-V4-Flash \
--served-model-name DeepSeek-V4-Flash --trust-remote-code \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8 --block-size 256 \
--gpu-memory-utilization 0.90 \
--max-model-len 262144 --max-num-seqs 64 --max-num-batched-tokens 8192 \
--max-cudagraph-capture-size 192 \
--compilation-config="{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\",\"custom_ops\":[\"all\"]}" \
--async-scheduling --no-scheduler-reserve-full-isl \
--enable-chunked-prefill --enable-prefix-caching --enable-flashinfer-autotune \
--attention-backend SPARSE_MLA_SM120 \
--kernel-config.moe_backend flashinfer_cutlass \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
--default-chat-template-kwargs.thinking=true \
--default-chat-template-kwargs.reasoning_effort=high \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 2 \
--speculative-config.draft_sample_method probabilistic'
The first launch JIT-compiles kernels (~5-6 min) into /cache. Reuse the same /cache volume so restarts are fast.
4. Test
1
2
3
4
curl -s http://localhost:8000/v1/models | jq
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"DeepSeek-V4-Flash","messages":[{"role":"user","content":"Hello"}],"max_tokens":64}' | jq
The endpoint is OpenAI-compatible at http://<host>:8000/v1.
Benchmark on 2x RTX Pro 6000 (TP2)
I benchmarked the live server through the OpenAI streaming API, not the raw engine. Each request uses a unique ~889-token prompt to defeat prefix caching, forces 256 generated tokens with ignore_eos, runs with MTP speculative decoding on, FP8 KV cache, and thinking off. Streaming lets me separate time-to-first-token (prefill) from the decode rate. I warm up every concurrency level first so the CUDA-graph capture cost is not counted in the measured numbers, then sweep 1 to 10 concurrent streams.
The chart below plots aggregate throughput as concurrency rises. Prefill and decode use separate Y axes because their scales differ by more than 10x.
| Streams | TTFT (s) | Prefill /stream (tok/s) | Prefill aggregate (tok/s) | Decode /stream (tok/s) | Decode aggregate (tok/s) |
|---|---|---|---|---|---|
| 1 | 0.10 | 8,471 | 8,471 | 209.9 | 209.9 |
| 2 | 0.19 | 4,862 | 9,713 | 167.5 | 313.5 |
| 3 | 0.21 | 5,162 | 8,852 | 139.5 | 374.7 |
| 4 | 0.27 | 4,264 | 9,224 | 105.9 | 387.6 |
| 5 | 0.37 | 3,189 | 10,213 | 108.4 | 465.2 |
| 6 | 0.42 | 3,026 | 9,948 | 113.2 | 580.7 |
| 7 | 0.49 | 2,742 | 10,149 | 95.9 | 566.4 |
| 8 | 0.62 | 2,092 | 10,357 | 104.0 | 668.7 |
| 9 | 0.70 | 1,850 | 10,411 | 87.5 | 635.3 |
| 10 | 0.71 | 2,035 | 10,465 | 87.7 | 697.3 |
Prompt ~889 tokens/request, 256 generated tokens/request, thinking off, ignore_eos.
Takeaways
- Prefill saturates the two GPUs around ~10,000 tok/s aggregate from about 5 concurrent streams onward, and TTFT stays under a second through 10 streams.
- Decode scales sub-linearly: ~210 tok/s single-stream up to ~697 tok/s aggregate at 10 streams (about 3.3x). Per-stream decode degrades gracefully from 210 to ~88 tok/s as load increases.
- Single-stream decode (210 tok/s) lines up with the reference Lucifer TP2 MTP-probabilistic figure (~207 tok/s), so the box is performing as expected.
These numbers are end-to-end over the streaming API, so they are slightly conservative compared to the raw engine, generation-only benchmarks in the reference notes. Those go up to 64 concurrent streams and include the faster B12X build, so the headline aggregate figures there are higher and not directly comparable to this 1-to-10 stream, Lucifer-image run. Decode aggregate also has mild run-to-run noise (about 5%).
Benchmark script bench_ds4.py (click to expand)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import json, time, threading, random, string, urllib.request
from concurrent.futures import ThreadPoolExecutor
URL = "http://localhost:8000/v1/chat/completions"
MODEL = "DeepSeek-V4-Flash"
GEN_TOKENS = 256 # forced generation length per stream
PROMPT_WORDS = 700 # ~900-1000 prompt tokens, unique per request
WORDS = ["alpha","river","copper","lunar","quartz","meadow","cipher","tangent","ember","willow",
"harbor","nimbus","pixel","cobalt","syntax","fathom","zephyr","granite","oracle","velvet"]
def make_prompt():
nonce = "".join(random.choices(string.ascii_lowercase, k=12))
body = " ".join(random.choice(WORDS) for _ in range(PROMPT_WORDS))
return f"[{nonce}] Read this token list then write a long neutral description. {body}"
def one_request():
payload = {
"model": MODEL,
"messages": [{"role": "user", "content": make_prompt()}],
"max_tokens": GEN_TOKENS,
"temperature": 0.7,
"ignore_eos": True,
"stream": True,
"stream_options": {"include_usage": True},
"chat_template_kwargs": {"thinking": False},
}
data = json.dumps(payload).encode()
req = urllib.request.Request(URL, data=data, headers={"Content-Type": "application/json"})
t0 = time.perf_counter()
t_first = None; t_last = None
prompt_tokens = gen_tokens = 0
with urllib.request.urlopen(req, timeout=600) as resp:
for raw in resp:
line = raw.decode("utf-8", "ignore").strip()
if not line.startswith("data:"):
continue
p = line[5:].strip()
if p == "[DONE]":
break
try:
d = json.loads(p)
except Exception:
continue
ch = d.get("choices") or []
if ch:
delta = ch[0].get("delta", {})
txt = delta.get("content") or delta.get("reasoning") or delta.get("reasoning_content")
if txt:
now = time.perf_counter()
if t_first is None:
t_first = now
t_last = now
u = d.get("usage")
if u:
prompt_tokens = u.get("prompt_tokens", prompt_tokens)
gen_tokens = u.get("completion_tokens", gen_tokens)
if t_first is None:
t_first = t_last = time.perf_counter()
return {
"ttft": t_first - t0,
"decode_time": max(t_last - t_first, 1e-6),
"prompt_tokens": prompt_tokens,
"gen_tokens": gen_tokens,
"t_first": t_first,
"t_last": t_last,
}
def run_level(n):
with ThreadPoolExecutor(max_workers=n) as ex:
results = list(ex.map(lambda _: one_request(), range(n)))
mean_ttft = sum(r["ttft"] for r in results) / n
mean_prefill = sum(r["prompt_tokens"] / r["ttft"] for r in results) / n
mean_decode = sum(r["gen_tokens"] / r["decode_time"] for r in results) / n
win = max(r["t_last"] for r in results) - min(r["t_first"] for r in results)
win = max(win, 1e-6)
agg_decode = sum(r["gen_tokens"] for r in results) / win
pf_win = max(r["t_first"] for r in results) - min(r["t_first"] - r["ttft"] for r in results)
agg_prefill = sum(r["prompt_tokens"] for r in results) / max(pf_win, 1e-6)
return mean_ttft, mean_prefill, agg_prefill, mean_decode, agg_decode, results[0]["prompt_tokens"]
print("Warming up all concurrency levels 1..10 (capturing CUDA graphs)...", flush=True)
for n in range(1, 11):
run_level(n)
print(f"{'N':>2} | {'TTFT(s)':>8} | {'prefill/str':>11} | {'prefill agg':>11} | {'decode/str':>10} | {'decode agg':>10}")
print("-" * 74)
for n in range(1, 11):
mt, mp, ap, md, ad, ptok = run_level(n)
print(f"{n:>2} | {mt:8.2f} | {mp:9.0f} | {ap:9.0f} | {md:8.1f} | {ad:8.1f}")
print(f"\n(prompt ~{ptok} tokens/req, {GEN_TOKENS} generated tokens/req, thinking off, ignore_eos)")