consciousness/scripts/provision-vllm.sh

#!/bin/bash
# provision-vllm.sh — Set up vllm on a RunPod GPU instance
#
# Usage: ssh into your RunPod instance and run:
#   curl -sSL https://raw.githubusercontent.com/... | bash
# Or just scp this script and run it.
#
# Expects: NVIDIA GPU with sufficient VRAM (B200: 192GB, A100: 80GB)
# Installs: vllm with Qwen 3.5 27B
# Exposes: OpenAI-compatible API on port 8000

set -euo pipefail

MODEL="${MODEL:-Qwen/Qwen3.5-27B}"
PORT="${PORT:-8000}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-262144}"
GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.95}"

echo "=== vllm provisioning ==="
echo "Model: $MODEL"
echo "Port:  $PORT"
echo "Max context: $MAX_MODEL_LEN"
echo ""

# --- Install vllm ---
echo "Installing vllm..."
pip install --upgrade vllm --break-system-packages 2>&1 | tail -3

# --- Use persistent storage ---
export HF_HOME=/workspace/huggingface

# --- Verify GPU ---
echo ""
echo "GPU status:"
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader
echo ""

# --- Download model (cached in /root/.cache/huggingface) ---
echo "Downloading model (this may take a while on first run)..."
pip install --upgrade huggingface_hub --break-system-packages -q 2>/dev/null
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('$MODEL')" 2>&1 | tail -5
echo ""

# --- Launch vllm ---
echo "Starting vllm server on port $PORT..."
echo "API will be available at http://0.0.0.0:$PORT/v1"
echo ""

exec vllm serve "$MODEL" \
    --port "$PORT" \
    --max-model-len "$MAX_MODEL_LEN" \
    --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser qwen35_coder \
    --reasoning-parser=qwen3 \
    --uvicorn-log-level warning
Add vllm provisioning script for RunPod GPU instances Sets up vllm with Qwen 2.5 27B Instruct, prefix caching enabled, Hermes tool call parser for function calling support. Configurable via environment variables (MODEL, PORT, MAX_MODEL_LEN). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-18 23:13:04 -04:00			`#!/bin/bash`
			`# provision-vllm.sh — Set up vllm on a RunPod GPU instance`
			`#`
			`# Usage: ssh into your RunPod instance and run:`
			`# curl -sSL https://raw.githubusercontent.com/... \| bash`
			`# Or just scp this script and run it.`
			`#`
			`# Expects: NVIDIA GPU with sufficient VRAM (B200: 192GB, A100: 80GB)`
Fix poc-agent for vllm/Qwen 3.5: reasoning display, tool parser - Always display reasoning tokens regardless of reasoning_effort setting — Qwen 3.5 thinks natively and the reasoning parser separates it into its own field - Remove chat_template_kwargs that disabled thinking when reasoning_effort was "none" - Add chat_template_kwargs field to ChatRequest for vllm compat - Update provision script: qwen3_xml tool parser, qwen3 reasoning parser, 262K context, 95% GPU memory utilization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-19 00:06:26 -04:00			`# Installs: vllm with Qwen 3.5 27B`
Add vllm provisioning script for RunPod GPU instances Sets up vllm with Qwen 2.5 27B Instruct, prefix caching enabled, Hermes tool call parser for function calling support. Configurable via environment variables (MODEL, PORT, MAX_MODEL_LEN). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-18 23:13:04 -04:00			`# Exposes: OpenAI-compatible API on port 8000`

			`set -euo pipefail`

Fix poc-agent for vllm/Qwen 3.5: reasoning display, tool parser - Always display reasoning tokens regardless of reasoning_effort setting — Qwen 3.5 thinks natively and the reasoning parser separates it into its own field - Remove chat_template_kwargs that disabled thinking when reasoning_effort was "none" - Add chat_template_kwargs field to ChatRequest for vllm compat - Update provision script: qwen3_xml tool parser, qwen3 reasoning parser, 262K context, 95% GPU memory utilization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-19 00:06:26 -04:00			`MODEL="${MODEL:-Qwen/Qwen3.5-27B}"`
Add vllm provisioning script for RunPod GPU instances Sets up vllm with Qwen 2.5 27B Instruct, prefix caching enabled, Hermes tool call parser for function calling support. Configurable via environment variables (MODEL, PORT, MAX_MODEL_LEN). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-18 23:13:04 -04:00			`PORT="${PORT:-8000}"`
Fix poc-agent for vllm/Qwen 3.5: reasoning display, tool parser - Always display reasoning tokens regardless of reasoning_effort setting — Qwen 3.5 thinks natively and the reasoning parser separates it into its own field - Remove chat_template_kwargs that disabled thinking when reasoning_effort was "none" - Add chat_template_kwargs field to ChatRequest for vllm compat - Update provision script: qwen3_xml tool parser, qwen3 reasoning parser, 262K context, 95% GPU memory utilization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-19 00:06:26 -04:00			`MAX_MODEL_LEN="${MAX_MODEL_LEN:-262144}"`
			`GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.95}"`
Add vllm provisioning script for RunPod GPU instances Sets up vllm with Qwen 2.5 27B Instruct, prefix caching enabled, Hermes tool call parser for function calling support. Configurable via environment variables (MODEL, PORT, MAX_MODEL_LEN). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-18 23:13:04 -04:00
			`echo "=== vllm provisioning ==="`
			`echo "Model: $MODEL"`
			`echo "Port: $PORT"`
			`echo "Max context: $MAX_MODEL_LEN"`
			`echo ""`

			`# --- Install vllm ---`
			`echo "Installing vllm..."`
Fix poc-agent for vllm/Qwen 3.5: reasoning display, tool parser - Always display reasoning tokens regardless of reasoning_effort setting — Qwen 3.5 thinks natively and the reasoning parser separates it into its own field - Remove chat_template_kwargs that disabled thinking when reasoning_effort was "none" - Add chat_template_kwargs field to ChatRequest for vllm compat - Update provision script: qwen3_xml tool parser, qwen3 reasoning parser, 262K context, 95% GPU memory utilization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-19 00:06:26 -04:00			`pip install --upgrade vllm --break-system-packages 2>&1 \| tail -3`

			`# --- Use persistent storage ---`
			`export HF_HOME=/workspace/huggingface`
Add vllm provisioning script for RunPod GPU instances Sets up vllm with Qwen 2.5 27B Instruct, prefix caching enabled, Hermes tool call parser for function calling support. Configurable via environment variables (MODEL, PORT, MAX_MODEL_LEN). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-18 23:13:04 -04:00
			`# --- Verify GPU ---`
			`echo ""`
			`echo "GPU status:"`
			`nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader`
			`echo ""`

			`# --- Download model (cached in /root/.cache/huggingface) ---`
			`echo "Downloading model (this may take a while on first run)..."`
Fix poc-agent for vllm/Qwen 3.5: reasoning display, tool parser - Always display reasoning tokens regardless of reasoning_effort setting — Qwen 3.5 thinks natively and the reasoning parser separates it into its own field - Remove chat_template_kwargs that disabled thinking when reasoning_effort was "none" - Add chat_template_kwargs field to ChatRequest for vllm compat - Update provision script: qwen3_xml tool parser, qwen3 reasoning parser, 262K context, 95% GPU memory utilization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-19 00:06:26 -04:00			`pip install --upgrade huggingface_hub --break-system-packages -q 2>/dev/null`
Add vllm provisioning script for RunPod GPU instances Sets up vllm with Qwen 2.5 27B Instruct, prefix caching enabled, Hermes tool call parser for function calling support. Configurable via environment variables (MODEL, PORT, MAX_MODEL_LEN). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-18 23:13:04 -04:00			`python3 -c "from huggingface_hub import snapshot_download; snapshot_download('$MODEL')" 2>&1 \| tail -5`
			`echo ""`

			`# --- Launch vllm ---`
			`echo "Starting vllm server on port $PORT..."`
			`echo "API will be available at http://0.0.0.0:$PORT/v1"`
			`echo ""`

			`exec vllm serve "$MODEL" \`
			`--port "$PORT" \`
			`--max-model-len "$MAX_MODEL_LEN" \`
			`--gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \`
			`--enable-prefix-caching \`
			`--enable-auto-tool-choice \`
Consolidate poc-memory and poc-agent configs poc-memory now reads from poc-agent's config.json5 as the primary config source. Memory-specific settings live in a "memory" section; API credentials are resolved from the shared model/backend config instead of being duplicated. - Add "memory" section to ~/.config/poc-agent/config.json5 - poc-memory config.rs: try shared config first, fall back to legacy JSONL - API fields (base_url, api_key, model) resolved via memory.agent_model -> models -> backend lookup - Add json5 dependency for proper JSON5 parsing - Update provisioning scripts: hermes -> qwen3_coder tool parser Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-19 21:49:58 -04:00			`--tool-call-parser qwen35_coder \`
Fix poc-agent for vllm/Qwen 3.5: reasoning display, tool parser - Always display reasoning tokens regardless of reasoning_effort setting — Qwen 3.5 thinks natively and the reasoning parser separates it into its own field - Remove chat_template_kwargs that disabled thinking when reasoning_effort was "none" - Add chat_template_kwargs field to ChatRequest for vllm compat - Update provision script: qwen3_xml tool parser, qwen3 reasoning parser, 262K context, 95% GPU memory utilization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-19 00:06:26 -04:00			`--reasoning-parser=qwen3 \`
Add vllm provisioning script for RunPod GPU instances Sets up vllm with Qwen 2.5 27B Instruct, prefix caching enabled, Hermes tool call parser for function calling support. Configurable via environment variables (MODEL, PORT, MAX_MODEL_LEN). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-03-18 23:13:04 -04:00			`--uvicorn-log-level warning`