poc-memory now reads from poc-agent's config.json5 as the primary config source. Memory-specific settings live in a "memory" section; API credentials are resolved from the shared model/backend config instead of being duplicated. - Add "memory" section to ~/.config/poc-agent/config.json5 - poc-memory config.rs: try shared config first, fall back to legacy JSONL - API fields (base_url, api_key, model) resolved via memory.agent_model -> models -> backend lookup - Add json5 dependency for proper JSON5 parsing - Update provisioning scripts: hermes -> qwen3_coder tool parser Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
26 lines
835 B
Text
26 lines
835 B
Text
FROM nvidia/cuda:12.9.0-devel-ubuntu22.04
|
|
|
|
ENV DEBIAN_FRONTEND=noninteractive
|
|
ENV PATH="/root/.local/bin:${PATH}"
|
|
|
|
RUN apt-get update -qq && \
|
|
apt-get install -y -qq python3 python3-pip git && \
|
|
rm -rf /var/lib/apt/lists/*
|
|
|
|
RUN pip install --no-cache-dir vllm ninja huggingface_hub
|
|
|
|
# Pre-download model weights (optional — comment out to pull at runtime)
|
|
# RUN python3 -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3.5-27B')"
|
|
|
|
EXPOSE 8000
|
|
|
|
ENTRYPOINT ["vllm", "serve"]
|
|
CMD ["Qwen/Qwen3.5-27B", \
|
|
"--port", "8000", \
|
|
"--max-model-len", "262144", \
|
|
"--gpu-memory-utilization", "0.95", \
|
|
"--enable-prefix-caching", \
|
|
"--enable-auto-tool-choice", \
|
|
"--tool-call-parser", "qwen3_coder", \
|
|
"--reasoning-parser", "qwen3", \
|
|
"--uvicorn-log-level", "warning"]
|