Running MiniMax-M2.1 Locally with Claude Code on Dual RTX Pro 6000

Run Claude Code with your own local MiniMax-M2.1 model using vLLM's native Anthropic API endpoint support.

Posted Dec 27, 2025 Updated Jan 1, 2026

By Ovidiu Dan

2 min read

Running MiniMax-M2.1 Locally with Claude Code

Run Claude Code with your own local MiniMax-M2.1 model using vLLM’s native Anthropic API endpoint support.

Hardware Used

Component	Specification
CPU	AMD Ryzen 9 7950X3D 16-Core Processor
Motherboard	ROG CROSSHAIR X670E HERO
GPU	Dual NVIDIA RTX Pro 6000 (96 GB VRAM each)
RAM	192 GB DDR5 5200 (note the model does not use the RAM, it fits into VRAM entirely to make it fast enough to run)

Install vLLM Nightly

Prerequisite: Ubuntu 24.04 and the proper NVIDIA drivers

  
mkdir vllm-nightly
cd vllm-nightly
uv venv --python 3.12 --seed
source .venv/bin/activate

uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

Download MiniMax-M2.1

Set up a separate environment for downloading models:

  
mkdir /models
cd /models
uv venv --python 3.12 --seed
source .venv/bin/activate

pip install huggingface_hub

Download the AWQ-quantized MiniMax-M2.1 model:

  
mkdir /models/awq
huggingface-cli download cyankiwi/MiniMax-M2.1-AWQ-4bit \
    --local-dir /models/awq/cyankiwi-MiniMax-M2.1-AWQ-4bit

Start vLLM Server

From your vLLM environment, launch the server with the Anthropic-compatible endpoint:

  
cd /git/vllm-nightly
source .venv/bin/activate

vllm serve \
    /models/awq/cyankiwi-MiniMax-M2.1-AWQ-4bit \
    --served-model-name MiniMax-M2.1-AWQ \
    --max-num-seqs 10 \
    --max-model-len 128000 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 1 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Key flags explained:

Flag	Purpose
`--tensor-parallel-size 2`	Splits model across 2 GPUs
`--enable-auto-tool-choice`	Enables tool/function calling
`--tool-call-parser minimax_m2`	Uses MiniMax-specific tool parsing
`--reasoning-parser minimax_m2_append_think`	Enables thinking/reasoning output

The server exposes /v1/messages (Anthropic-compatible) at http://localhost:8000.

Install Claude Code

Install Claude Code on macOS, Linux, or WSL:

curl -fsSL https://claude.ai/install.sh | bash

See the official Claude Code documentation for more details.

Configure Claude Code

Create settings.json

Create or edit ~/.claude/settings.json:

  
{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:8000",
    "ANTHROPIC_AUTH_TOKEN": "dummy",
    "API_TIMEOUT_MS": "3000000",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "ANTHROPIC_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_SMALL_FAST_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "MiniMax-M2.1-AWQ"
  }
}

Skip Onboarding (Workaround for Bug)

Due to a known bug in Claude Code 2.0.65+, fresh installs may ignore settings.json during onboarding. Add hasCompletedOnboarding to ~/.claude.json:

  
# If ~/.claude.json doesn't exist, create it:
echo '{"hasCompletedOnboarding": true}' > ~/.claude.json

# If it exists, add the field manually or use jq:
jq '. + {"hasCompletedOnboarding": true}' ~/.claude.json > tmp.json && mv tmp.json ~/.claude.json

Run Claude Code

With vLLM running in one terminal, open another and run:

claude

Claude Code will now use your local MiniMax-M2.1 model! If you also want to configure the Claude Code VSCode extension, see here.

References

llm vllm

This post is licensed under CC BY 4.0 by the author.