vllm-metal

Install

# review the file content first
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash
 
# in .bashrc
alias vllm='env HF_HUB_OFFLINE=1 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 PATH="$HOME/.venv-vllm-metal/bin:$PATH" "$HOME/.venv-vllm-metal/bin/vllm"'

Run vllm without HF_HUB_OFFLINE=1 once to download the model to .cache/huggingface, the cache directory used for Hugging Face models.

vllm-metal sets VLLM_METAL_MEMORY_FRACTION to 0.90 by default, which results in very high memory utilization:

(EngineCore pid=84245) INFO 04-25 17:51:26 [cache_policy.py:709] Paged attention: VLLM_METAL_MEMORY_FRACTION=auto, defaulting to 0.90 for paged path
(EngineCore pid=84245) INFO 04-25 17:51:26 [cache_policy.py:567] Paged attention memory breakdown: metal_limit=51.54GB, fraction=0.90, usable_metal=46.39GB, model_memory=14.96GB, overhead=1.13GB, kv_budget=30.30GB, per_block_bytes=3604480, num_blocks=8405, max_tokens_cached=134480

Based on https://docs.vllm.ai/projects/vllm-metal/en/latest/configuration/, we could use VLLM_METAL_* environment variables to control the behavior.

You may need to reduce --max-model-len to fit the model or KV cache in memory:

ValueError: Paged attention: not enough Metal memory for KV cache. metal_limit=51.54GB, fraction=0.75, usable_metal=38.65GB, model_memory=21.64GB, overhead=1.08GB, kv_budget=-0.54GB. Mitigations: increase VLLM_METAL_MEMORY_FRACTION, use a smaller or more quantized model.
ValueError: To serve at least one request with the models's max seq len (131072), (27.5 GiB KV cache is needed, which is larger than the available KV cache memory (21.01 GiB). Based on the available memory, the estimated maximum model length is 100160. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

Gemma 4

overall model support remains experimental https://github.com/vllm-project/vllm-metal/blob/main/docs/supported_models.md

curl -o ~/.local/share/tool_chat_template_gemma4.jinja "https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/tool_chat_template_gemma4.jinja"
 
export VLLM_METAL_MEMORY_FRACTION=0.65
 
vllm serve unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit \
  --max-model-len 65536 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template ~/.local/share/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt image=0,audio=0 \
  --port <port> --api-key <your key>

For Gemma 4 specifically, see https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#full-featured-server-launch.

All in one Bash script:

#!/bin/bash
 
env HF_HUB_OFFLINE=1 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 VLLM_METAL_MEMORY_FRACTION=0.65 PATH="$HOME/.venv-vllm-metal/bin:$PATH" "$HOME/.venv-vllm-metal/bin/vllm" serve unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit \
  --max-model-len 65536 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template ~/.local/share/tool_chat_template_gemma4.jinja \
  --port <port> --api-key <your key>

Configuration Tips

  • Set --max-model-len to match your actual workload. The default context length can be very large; reducing it saves memory for KV cache. Note: 1K = 1024 in model cards.
  • For text-only workloads, pass --limit-mm-per-prompt image=0,audio=0 to skip multimodal profiling entirely.
  • You can enable thinking with --default-chat-template-kwargs '{"enable_thinking": true}', but it won’t work in responses API. Even in chat completions API, you may get an empty response after extensive reasoning.
    • To workaround this, try prepending ~/.local/share/tool_chat_template_gemma4.jinja with:
{%- if enable_thinking is not defined -%}
  {%- set enable_thinking = true -%}
{%- endif -%}
  • vLLM respects the model’s generation_config.json, which is good:
[model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.

Qwen3.6

#export VLLM_METAL_USE_PAGED_ATTENTION=0
export VLLM_METAL_MEMORY_FRACTION=0.80
 
vllm serve mlx-community/Qwen3.6-35B-A3B-4bit \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --port <port> --api-key <your key>

For Qwen3.6 specifically, see https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html.

We recommend using the following set of sampling parameters for generation

  • Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  • Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.

  • Avoid the dense model, which requires more RAM for KV cache. Check kv_budget=... with paged attention enabled to see how much more RAM is needed:
Qwen3.6-27B-UD-MLX-4bit

ValueError: Paged attention: not enough Metal memory for KV cache. metal_limit=51.54GB, fraction=0.75, usable_metal=38.65GB, model_memory=26.19GB, overhead=1.16GB, kv_budget=-28.11GB. Mitigations: increase VLLM_METAL_MEMORY_FRACTION, use a smaller or more quantized model.

Qwen3.6-35B-A3B-UD-MLX-4bit

ValueError: Paged attention: not enough Metal memory for KV cache. metal_limit=51.54GB, fraction=0.75, usable_metal=38.65GB, model_memory=22.75GB, overhead=1.00GB, kv_budget=-1.57GB. Mitigations: increase VLLM_METAL_MEMORY_FRACTION, use a smaller or more quantized model.
  • Paged attention should be disabled:
Hybrid model (e.g., Qwen3.5) with paged attention enabled. Using block-size translation (PR #235) to convert vLLM's large block_size to a Metal kernel-compatible size.
  Mechanism: Each vLLM block is split into multiple kernel blocks.
  Example: vLLM block_size=160 → kernel block_size=32 (ratio=5).
  The KV cache is reshaped (zero-copy) and block tables are expanded.
  This is a logical transformation — physical memory is unchanged.
  Note: The default MLX path (without paged attention) is recommended for hybrid models as it has no translation overhead.
  • But it’s broken with the same --max-model-len that works with paged attention enabled:
ValueError: To serve at least one request with the models's max seq len (32768), (0.7 GiB KV cache is needed, which is larger than the available KV cache memory (0.7 GiB). Based on the available memory, the estimated maximum model length is 32736. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

For CLI arguments, see https://docs.vllm.ai/en/latest/cli/serve/.

For authentication and security, see https://docs.vllm.ai/en/latest/usage/security/?h=api+key#api-key-authentication-limitations.

Important: Do not rely exclusively on --api-key for securing access to vLLM. Additional security measures are required for production deployments.

Configure llm CLI

pipx install llm
 
cat > "$(dirname "$(llm logs path)")"/extra-openai-models.yaml <<'EOF'
- model_id: qwen3.6
  model_name: unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit
  api_base: "http://localhost:6599"
EOF
llm models default qwen3.6

References