From 78fa4b639f322f2235796d0ec7fb8d5d44af091b Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 16 Apr 2026 00:49:04 -0400 Subject: [PATCH] training: document state files Add State Files section to DESIGN.md documenting: - /tmp/vllm_weight_handles.pt (IPC handles) - trained-responses.json (prevent re-training) - finetune-alternates marker file - In-memory optimizer state (not persisted) Co-Authored-By: Proof of Concept --- training/DESIGN.md | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/training/DESIGN.md b/training/DESIGN.md index bf6a774..00ca499 100644 --- a/training/DESIGN.md +++ b/training/DESIGN.md @@ -204,9 +204,32 @@ against live GPU weights block by block, memcpy only changed regions. For small behavioral updates, turns a 54GB write into a few hundred MB. -- Every 10 minutes via cron on B200 +- Scheduled 10 minutes after training (batched) - Daily rsync to moria for long-term storage -- Tool: `apollo-checkpoint sync --model-dir ` (Rust) +- Tool: `apollo-checkpoint sync --model-dir ` + +## State Files + +### B200 (training server) + +| File | Purpose | +|------|---------| +| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by train_router to construct HF model with vLLM weight views. | +| `/*.safetensors` | Model weights. Updated in-place by checkpoint_sync. | + +### Moria (client) + +| File | Purpose | +|------|---------| +| `~/.consciousness/cache/trained-responses.json` | Timestamps (ms) of responses already sent to /train. Prevents re-training the same response. | +| `~/.consciousness/cache/finetune-alternates` | Marker file. If exists, alternate responses are generated during divergence scoring to show what model would say without memories. | + +### In-memory (not persisted) + +| State | Location | Notes | +|-------|----------|-------| +| Apollo optimizer state | train_router._model | Created fresh each /train call. ~10GB for rank-256. Not persisted between requests. | +| HF model with vLLM views | train_router._model | Lazy-loaded on first /train. Parameters point to vLLM's GPU memory. | ## Hyperparameters