training: document state files
Add State Files section to DESIGN.md documenting: - /tmp/vllm_weight_handles.pt (IPC handles) - trained-responses.json (prevent re-training) - finetune-alternates marker file - In-memory optimizer state (not persisted) Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
parent
7e7e9a4b69
commit
78fa4b639f
1 changed files with 25 additions and 2 deletions
|
|
@ -204,9 +204,32 @@ against live GPU weights block by block, memcpy only changed
|
|||
regions. For small behavioral updates, turns a 54GB write into
|
||||
a few hundred MB.
|
||||
|
||||
- Every 10 minutes via cron on B200
|
||||
- Scheduled 10 minutes after training (batched)
|
||||
- Daily rsync to moria for long-term storage
|
||||
- Tool: `apollo-checkpoint sync --model-dir <path>` (Rust)
|
||||
- Tool: `apollo-checkpoint sync --model-dir <path>`
|
||||
|
||||
## State Files
|
||||
|
||||
### B200 (training server)
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by train_router to construct HF model with vLLM weight views. |
|
||||
| `<model_dir>/*.safetensors` | Model weights. Updated in-place by checkpoint_sync. |
|
||||
|
||||
### Moria (client)
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `~/.consciousness/cache/trained-responses.json` | Timestamps (ms) of responses already sent to /train. Prevents re-training the same response. |
|
||||
| `~/.consciousness/cache/finetune-alternates` | Marker file. If exists, alternate responses are generated during divergence scoring to show what model would say without memories. |
|
||||
|
||||
### In-memory (not persisted)
|
||||
|
||||
| State | Location | Notes |
|
||||
|-------|----------|-------|
|
||||
| Apollo optimizer state | train_router._model | Created fresh each /train call. ~10GB for rank-256. Not persisted between requests. |
|
||||
| HF model with vLLM views | train_router._model | Lazy-loaded on first /train. Parameters point to vLLM's GPU memory. |
|
||||
|
||||
## Hyperparameters
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue