vllm-metal
Install
# review the file content first
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash
# in .bashrc
alias vllm='env HF_HUB_OFFLINE=1 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 PATH="$HOME/.venv-vllm-metal/bin:$PATH" "$HOME/.venv-vllm-metal/bin/vllm"'Run vllm without HF_HUB_OFFLINE=1 once to download the model to .cache/huggingface, the cache directory used for Hugging Face models.
vllm-metal sets VLLM_METAL_MEMORY_FRACTION to 0.90 by default, which results in very high memory utilization:
(EngineCore pid=84245) INFO 04-25 17:51:26 [cache_policy.py:709] Paged attention: VLLM_METAL_MEMORY_FRACTION=auto, defaulting to 0.90 for paged path
(EngineCore pid=84245) INFO 04-25 17:51:26 [cache_policy.py:567] Paged attention memory breakdown: metal_limit=51.54GB, fraction=0.90, usable_metal=46.39GB, model_memory=14.96GB, overhead=1.13GB, kv_budget=30.30GB, per_block_bytes=3604480, num_blocks=8405, max_tokens_cached=134480
Based on https://docs.vllm.ai/projects/vllm-metal/en/latest/configuration/, we could use VLLM_METAL_* environment variables to control the behavior.
You may need to reduce --max-model-len to fit the model or KV cache in memory:
ValueError: Paged attention: not enough Metal memory for KV cache. metal_limit=51.54GB, fraction=0.75, usable_metal=38.65GB, model_memory=21.64GB, overhead=1.08GB, kv_budget=-0.54GB. Mitigations: increase VLLM_METAL_MEMORY_FRACTION, use a smaller or more quantized model.
ValueError: To serve at least one request with the models's max seq len (131072), (27.5 GiB KV cache is needed, which is larger than the available KV cache memory (21.01 GiB). Based on the available memory, the estimated maximum model length is 100160. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
Gemma 4
overall model support remains experimental https://github.com/vllm-project/vllm-metal/blob/main/docs/supported_models.md
curl -o ~/.local/share/tool_chat_template_gemma4.jinja "https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/tool_chat_template_gemma4.jinja"
export VLLM_METAL_MEMORY_FRACTION=0.65
vllm serve unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit \
--max-model-len 65536 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template ~/.local/share/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt image=0,audio=0 \
--port <port> --api-key <your key>For Gemma 4 specifically, see https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#full-featured-server-launch.
All in one Bash script:
#!/bin/bash
env HF_HUB_OFFLINE=1 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 VLLM_METAL_MEMORY_FRACTION=0.65 PATH="$HOME/.venv-vllm-metal/bin:$PATH" "$HOME/.venv-vllm-metal/bin/vllm" serve unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit \
--max-model-len 65536 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template ~/.local/share/tool_chat_template_gemma4.jinja \
--port <port> --api-key <your key>Configuration Tips
- Set
--max-model-lento match your actual workload. The default context length can be very large; reducing it saves memory for KV cache. Note: 1K = 1024 in model cards. - For text-only workloads, pass
--limit-mm-per-prompt image=0,audio=0to skip multimodal profiling entirely. - You can enable thinking with
--default-chat-template-kwargs '{"enable_thinking": true}', but it won’t work in responses API. Even in chat completions API, you may get an empty response after extensive reasoning.- To workaround this, try prepending
~/.local/share/tool_chat_template_gemma4.jinjawith:
- To workaround this, try prepending
{%- if enable_thinking is not defined -%}
{%- set enable_thinking = true -%}
{%- endif -%}- vLLM respects the model’s
generation_config.json, which is good:
[model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Qwen3.6
#export VLLM_METAL_USE_PAGED_ATTENTION=0
export VLLM_METAL_MEMORY_FRACTION=0.80
vllm serve mlx-community/Qwen3.6-35B-A3B-4bit \
--max-model-len 32768 \
--enable-auto-tool-choice \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--port <port> --api-key <your key>For Qwen3.6 specifically, see https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html.
- For tool calling, see also https://docs.vllm.ai/en/latest/features/tool_calling/.
- For reasoning parsers, see also https://docs.vllm.ai/en/latest/features/reasoning_outputs/.
- To control infinite loop in thinking, see https://docs.vllm.ai/en/latest/features/reasoning_outputs/#thinking-budget-control. But this could be a common bug of Qwen models: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/discussions/20#69e27b85a3ccbba450327963.
- For sampling parameters, see https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8:
We recommend using the following set of sampling parameters for generation
- Thinking mode for general tasks:
temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0- Thinking mode for precise coding tasks (e.g. WebDev):
temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0- Instruct (or non-thinking) mode:
temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0Please note that the support for sampling parameters varies according to inference frameworks.
- Avoid the dense model, which requires more RAM for KV cache. Check
kv_budget=...with paged attention enabled to see how much more RAM is needed:
Qwen3.6-27B-UD-MLX-4bit
ValueError: Paged attention: not enough Metal memory for KV cache. metal_limit=51.54GB, fraction=0.75, usable_metal=38.65GB, model_memory=26.19GB, overhead=1.16GB, kv_budget=-28.11GB. Mitigations: increase VLLM_METAL_MEMORY_FRACTION, use a smaller or more quantized model.
Qwen3.6-35B-A3B-UD-MLX-4bit
ValueError: Paged attention: not enough Metal memory for KV cache. metal_limit=51.54GB, fraction=0.75, usable_metal=38.65GB, model_memory=22.75GB, overhead=1.00GB, kv_budget=-1.57GB. Mitigations: increase VLLM_METAL_MEMORY_FRACTION, use a smaller or more quantized model.
- Paged attention should be disabled:
Hybrid model (e.g., Qwen3.5) with paged attention enabled. Using block-size translation (PR #235) to convert vLLM's large block_size to a Metal kernel-compatible size.
Mechanism: Each vLLM block is split into multiple kernel blocks.
Example: vLLM block_size=160 → kernel block_size=32 (ratio=5).
The KV cache is reshaped (zero-copy) and block tables are expanded.
This is a logical transformation — physical memory is unchanged.
Note: The default MLX path (without paged attention) is recommended for hybrid models as it has no translation overhead.
- But it’s broken with the same
--max-model-lenthat works with paged attention enabled:
ValueError: To serve at least one request with the models's max seq len (32768), (0.7 GiB KV cache is needed, which is larger than the available KV cache memory (0.7 GiB). Based on the available memory, the estimated maximum model length is 32736. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
For CLI arguments, see https://docs.vllm.ai/en/latest/cli/serve/.
For authentication and security, see https://docs.vllm.ai/en/latest/usage/security/?h=api+key#api-key-authentication-limitations.
Important: Do not rely exclusively on
--api-keyfor securing access to vLLM. Additional security measures are required for production deployments.
Configure llm CLI
pipx install llm
cat > "$(dirname "$(llm logs path)")"/extra-openai-models.yaml <<'EOF'
- model_id: qwen3.6
model_name: unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit
api_base: "http://localhost:6599"
EOF
llm models default qwen3.6