Post

Running MiniMax-M2.1 Locally with Claude Code on Dual RTX Pro 6000

Run Claude Code with your own local MiniMax-M2.1 model using vLLM's native Anthropic API endpoint support.

Running MiniMax-M2.1 Locally with Claude Code on Dual RTX Pro 6000

Running MiniMax-M2.1 Locally with Claude Code

Run Claude Code with your own local MiniMax-M2.1 model using vLLM’s native Anthropic API endpoint support.

Hardware Used

ComponentSpecification
CPUAMD Ryzen 9 7950X3D 16-Core Processor
MotherboardROG CROSSHAIR X670E HERO
GPUDual NVIDIA RTX Pro 6000 (96 GB VRAM each)
RAM192 GB DDR5 5200 (note the model does not use the RAM, it fits into VRAM entirely to make it fast enough to run)

Install vLLM Nightly

Prerequisite: Ubuntu 24.04 and the proper NVIDIA drivers

1
2
3
4
5
6
7
8
mkdir vllm-nightly
cd vllm-nightly
uv venv --python 3.12 --seed
source .venv/bin/activate

uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

Download MiniMax-M2.1

Set up a separate environment for downloading models:

1
2
3
4
5
6
mkdir /models
cd /models
uv venv --python 3.12 --seed
source .venv/bin/activate

pip install huggingface_hub

Download the AWQ-quantized MiniMax-M2.1 model:

1
2
3
mkdir /models/awq
huggingface-cli download cyankiwi/MiniMax-M2.1-AWQ-4bit \
    --local-dir /models/awq/cyankiwi-MiniMax-M2.1-AWQ-4bit

Start vLLM Server

From your vLLM environment, launch the server with the Anthropic-compatible endpoint:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cd ~/vllm-nightly
source .venv/bin/activate

vllm serve \
    /models/awq/cyankiwi-MiniMax-M2.1-AWQ-4bit \
    --served-model-name MiniMax-M2.1-AWQ \
    --max-num-seqs 10 \
    --max-model-len 128000 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 1 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Key flags explained:

FlagPurpose
--tensor-parallel-size 2Splits model across 2 GPUs
--enable-auto-tool-choiceEnables tool/function calling
--tool-call-parser minimax_m2Uses MiniMax-specific tool parsing
--reasoning-parser minimax_m2_append_thinkEnables thinking/reasoning output

The server exposes /v1/messages (Anthropic-compatible) at http://localhost:8000.


Install Claude Code

Install Claude Code on macOS, Linux, or WSL:

1
curl -fsSL https://claude.ai/install.sh | bash

See the official Claude Code documentation for more details.


Configure Claude Code

Create settings.json

Create or edit ~/.claude/settings.json:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:8000",
    "ANTHROPIC_AUTH_TOKEN": "dummy",
    "API_TIMEOUT_MS": "3000000",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "ANTHROPIC_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_SMALL_FAST_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "MiniMax-M2.1-AWQ"
  }
}

Skip Onboarding (Workaround for Bug)

Due to a known bug in Claude Code 2.0.65+, fresh installs may ignore settings.json during onboarding. Add hasCompletedOnboarding to ~/.claude.json:

1
2
3
4
5
# If ~/.claude.json doesn't exist, create it:
echo '{"hasCompletedOnboarding": true}' > ~/.claude.json

# If it exists, add the field manually or use jq:
jq '. + {"hasCompletedOnboarding": true}' ~/.claude.json > tmp.json && mv tmp.json ~/.claude.json

Run Claude Code

With vLLM running in one terminal, open another and run:

1
claude

Claude Code will now use your local MiniMax-M2.1 model! If you also want to configure the Claude Code VSCode extension, see here.


References

This post is licensed under CC BY 4.0 by the author.