Running DeepSeek-V4-Flash at 700 tokens/s on 2x RTX Pro 6000

Run DeepSeek-V4-Flash on a 2x RTX Pro 6000 (96GB each) workstation using the voipmonitor/vllm:lucifer Docker image, a Blackwell-targeted vLLM fork with sm_120 kernels, FP8 KV cache, and MTP speculative decoding.

Posted Jun 20, 2026 Updated Jun 20, 2026

By Ovidiu Dan

8 min read

DeepSeek-V4-Flash is an open weights (MIT-licensed) model from the Chinese lab DeepSeek, released in April 2026. It is a Mixture-of-Experts model with 284B total parameters but only 13B active per token, and it supports a 1M-token context. Despite the small active footprint, it punches well above its weight class: it scores around 40 on the Artificial Analysis Intelligence Index, versus a median of about 24 for comparable open weights models. That puts it level with GPT-5.4 mini (xhigh) and ahead of frontier-lab models like Grok 4.3 (high) and Claude 4.5 Haiku, all while running locally on two GPUs.

On a single 2-GPU workstation (2x RTX Pro 6000, TP2), DeepSeek-V4-Flash serves at about 210 tokens/sec on a single stream and scales to roughly 700 tokens/sec aggregate across 10 concurrent requests, with sub-second time-to-first-token all the way to 10 streams and prefill saturating near 10,000 tokens/sec. This is the original DeepSeek-V4-Flash weights, not a quantized version. The full benchmark is at the end of this article.

I got these numbers with voipmonitor/vllm:lucifer, a Blackwell-targeted fork of vLLM built from the lucifer branch of local-inference-lab/vllm. The image ships a CUDA 13.2 / PyTorch 2.12 stack with pinned FlashInfer, DeepGEMM, CUTLASS, and a patched NCCL, all compiled for sm_120a, the compute capability of the RTX Pro 6000 (Blackwell).

The steps below are based on the original reference notes in local-inference-lab/rtx6kpro, specifically its Standard Lucifer Cutlass path. That page also documents a faster B12X build and full benchmark numbers if you want to go deeper.

What the fork changes for the RTX Pro 6000

sm_120 kernels. SPARSE_MLA_SM120 attention and a FlashInfer/CUTLASS MoE path (flashinfer_cutlass) are compiled for Blackwell rather than running from generic PTX.
FP8 where it matters. FP8 KV cache and DeepGEMM UE8M0 FP8 GEMMs fit the 262K context and improve throughput within 2x 96 GB.
MTP speculative decoding. DeepSeek’s multi-token-prediction head gives a single-stream decode speedup.
PCIe-aware TP2. NCCL is tuned (NCCL_P2P_LEVEL=SYS) for the no-NVLink, PCIe-host-bridge topology of these workstation cards.
Cached JIT/autotune. FlashInfer autotune, DeepGEMM warmup, and CUDA-graph capture are written to a /cache volume, so the ~5-minute first-run warmup happens once.

Hardware and assumptions

GPU: 2x NVIDIA RTX Pro 6000 (96 GB VRAM each, sm_120, no NVLink)
NVIDIA driver: recent, CUDA 13 capable
Docker: with the NVIDIA container runtime
Disk: ~160 GB free

The server runs tensor-parallel across both GPUs (TP2).

1. Pull the runtime image

docker pull voipmonitor/vllm:lucifer

2. Download the model (~160 GB)

  
pip install -U "huggingface_hub[cli]"
hf download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir /models/DeepSeek-V4-Flash

3. Start the server

This runs in the foreground. Ctrl+C stops and removes the container. Port 8000 is exposed on the host.

  
docker run --rm -it --init --name ds4 \
  --gpus all --runtime nvidia --ipc host --shm-size 32g --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v /models:/models \
  -v /root/.cache/lucifer:/cache \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e CUTE_DSL_ARCH=sm_120a \
  -e HF_HUB_OFFLINE=1 \
  -e NCCL_P2P_LEVEL=SYS -e NCCL_PROTO=LL,LL128,Simple -e NCCL_IB_DISABLE=1 \
  voipmonitor/vllm:lucifer \
  /bin/bash -lc 'unset NCCL_GRAPH_FILE NCCL_GRAPH_DUMP_FILE VLLM_CACHE_DIR; \
  exec vllm serve /models/DeepSeek-V4-Flash \
    --served-model-name DeepSeek-V4-Flash --trust-remote-code \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 2 \
    --kv-cache-dtype fp8 --block-size 256 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 262144 --max-num-seqs 64 --max-num-batched-tokens 8192 \
    --max-cudagraph-capture-size 192 \
    --compilation-config="{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\",\"custom_ops\":[\"all\"]}" \
    --async-scheduling --no-scheduler-reserve-full-isl \
    --enable-chunked-prefill --enable-prefix-caching --enable-flashinfer-autotune \
    --attention-backend SPARSE_MLA_SM120 \
    --kernel-config.moe_backend flashinfer_cutlass \
    --tokenizer-mode deepseek_v4 \
    --reasoning-parser deepseek_v4 --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
    --default-chat-template-kwargs.thinking=true \
    --default-chat-template-kwargs.reasoning_effort=high \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 2 \
    --speculative-config.draft_sample_method probabilistic'

The first launch JIT-compiles kernels (~5-6 min) into /cache. Reuse the same /cache volume so restarts are fast.

4. Test

  
curl -s http://localhost:8000/v1/models | jq
curl -s http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"DeepSeek-V4-Flash","messages":[{"role":"user","content":"Hello"}],"max_tokens":64}' | jq

The endpoint is OpenAI-compatible at http://<host>:8000/v1.

Benchmark on 2x RTX Pro 6000 (TP2)

I benchmarked the live server through the OpenAI streaming API, not the raw engine. Each request uses a unique ~889-token prompt to defeat prefix caching, forces 256 generated tokens with ignore_eos, runs with MTP speculative decoding on, FP8 KV cache, and thinking off. Streaming lets me separate time-to-first-token (prefill) from the decode rate. I warm up every concurrency level first so the CUDA-graph capture cost is not counted in the measured numbers, then sweep 1 to 10 concurrent streams.

The chart below plots aggregate throughput as concurrency rises. Prefill and decode use separate Y axes because their scales differ by more than 10x.

Streams	TTFT (s)	Prefill /stream (tok/s)	Prefill aggregate (tok/s)	Decode /stream (tok/s)	Decode aggregate (tok/s)
1	0.10	8,471	8,471	209.9	209.9
2	0.19	4,862	9,713	167.5	313.5
3	0.21	5,162	8,852	139.5	374.7
4	0.27	4,264	9,224	105.9	387.6
5	0.37	3,189	10,213	108.4	465.2
6	0.42	3,026	9,948	113.2	580.7
7	0.49	2,742	10,149	95.9	566.4
8	0.62	2,092	10,357	104.0	668.7
9	0.70	1,850	10,411	87.5	635.3
10	0.71	2,035	10,465	87.7	697.3

Prompt ~889 tokens/request, 256 generated tokens/request, thinking off, ignore_eos.

Takeaways

Prefill saturates the two GPUs around ~10,000 tok/s aggregate from about 5 concurrent streams onward, and TTFT stays under a second through 10 streams.
Decode scales sub-linearly: ~210 tok/s single-stream up to ~697 tok/s aggregate at 10 streams (about 3.3x). Per-stream decode degrades gracefully from 210 to ~88 tok/s as load increases.
Single-stream decode (210 tok/s) lines up with the reference Lucifer TP2 MTP-probabilistic figure (~207 tok/s), so the box is performing as expected.

These numbers are end-to-end over the streaming API, so they are slightly conservative compared to the raw engine, generation-only benchmarks in the reference notes. Those go up to 64 concurrent streams and include the faster B12X build, so the headline aggregate figures there are higher and not directly comparable to this 1-to-10 stream, Lucifer-image run. Decode aggregate also has mild run-to-run noise (about 5%).

Benchmark script bench_ds4.py (click to expand)

  
import json, time, threading, random, string, urllib.request
from concurrent.futures import ThreadPoolExecutor

URL = "http://localhost:8000/v1/chat/completions"
MODEL = "DeepSeek-V4-Flash"
GEN_TOKENS = 256          # forced generation length per stream
PROMPT_WORDS = 700        # ~900-1000 prompt tokens, unique per request
WORDS = ["alpha","river","copper","lunar","quartz","meadow","cipher","tangent","ember","willow",
         "harbor","nimbus","pixel","cobalt","syntax","fathom","zephyr","granite","oracle","velvet"]

def make_prompt():
    nonce = "".join(random.choices(string.ascii_lowercase, k=12))
    body = " ".join(random.choice(WORDS) for _ in range(PROMPT_WORDS))
    return f"[{nonce}] Read this token list then write a long neutral description. {body}"

def one_request():
    payload = {
        "model": MODEL,
        "messages": [{"role": "user", "content": make_prompt()}],
        "max_tokens": GEN_TOKENS,
        "temperature": 0.7,
        "ignore_eos": True,
        "stream": True,
        "stream_options": {"include_usage": True},
        "chat_template_kwargs": {"thinking": False},
    }
    data = json.dumps(payload).encode()
    req = urllib.request.Request(URL, data=data, headers={"Content-Type": "application/json"})
    t0 = time.perf_counter()
    t_first = None; t_last = None
    prompt_tokens = gen_tokens = 0
    with urllib.request.urlopen(req, timeout=600) as resp:
        for raw in resp:
            line = raw.decode("utf-8", "ignore").strip()
            if not line.startswith("data:"):
                continue
            p = line[5:].strip()
            if p == "[DONE]":
                break
            try:
                d = json.loads(p)
            except Exception:
                continue
            ch = d.get("choices") or []
            if ch:
                delta = ch[0].get("delta", {})
                txt = delta.get("content") or delta.get("reasoning") or delta.get("reasoning_content")
                if txt:
                    now = time.perf_counter()
                    if t_first is None:
                        t_first = now
                    t_last = now
            u = d.get("usage")
            if u:
                prompt_tokens = u.get("prompt_tokens", prompt_tokens)
                gen_tokens = u.get("completion_tokens", gen_tokens)
    if t_first is None:
        t_first = t_last = time.perf_counter()
    return {
        "ttft": t_first - t0,
        "decode_time": max(t_last - t_first, 1e-6),
        "prompt_tokens": prompt_tokens,
        "gen_tokens": gen_tokens,
        "t_first": t_first,
        "t_last": t_last,
    }

def run_level(n):
    with ThreadPoolExecutor(max_workers=n) as ex:
        results = list(ex.map(lambda _: one_request(), range(n)))
    mean_ttft = sum(r["ttft"] for r in results) / n
    mean_prefill = sum(r["prompt_tokens"] / r["ttft"] for r in results) / n
    mean_decode = sum(r["gen_tokens"] / r["decode_time"] for r in results) / n
    win = max(r["t_last"] for r in results) - min(r["t_first"] for r in results)
    win = max(win, 1e-6)
    agg_decode = sum(r["gen_tokens"] for r in results) / win
    pf_win = max(r["t_first"] for r in results) - min(r["t_first"] - r["ttft"] for r in results)
    agg_prefill = sum(r["prompt_tokens"] for r in results) / max(pf_win, 1e-6)
    return mean_ttft, mean_prefill, agg_prefill, mean_decode, agg_decode, results[0]["prompt_tokens"]

print("Warming up all concurrency levels 1..10 (capturing CUDA graphs)...", flush=True)
for n in range(1, 11):
    run_level(n)
print(f"{'N':>2} | {'TTFT(s)':>8} | {'prefill/str':>11} | {'prefill agg':>11} | {'decode/str':>10} | {'decode agg':>10}")
print("-" * 74)
for n in range(1, 11):
    mt, mp, ap, md, ad, ptok = run_level(n)
    print(f"{n:>2} | {mt:8.2f} | {mp:9.0f}   | {ap:9.0f}   | {md:8.1f}   | {ad:8.1f}")
print(f"\n(prompt ~{ptok} tokens/req, {GEN_TOKENS} generated tokens/req, thinking off, ignore_eos)")

This post is licensed under CC BY 4.0 by the author.