consciousness/training/DESIGN.md

# Apollo Training System

## Overview

Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference.
Full-weight updates (not LoRA) using Apollo optimizer with rank-64
gradient projection. No pause required — HOGWILD concurrent training.
Weights shared via CUDA IPC between vLLM and the training process.

The training signal comes from two sources:
1. **Direct examples** — agent logs, conversation transcripts, flagged
   behavioral moments
2. **Dream-generated scenarios** — the dream loop generates situations
   from recent experience; the model responds; good responses become
   training data with instructions stripped

## Architecture

```
┌─────────────────────────────────────────────────────┐
│                    GPU VRAM (192GB)                  │
│                                                     │
│  ┌──────────────────────────────────────────────┐   │
│  │        Model Weights (54GB, bf16)            │   │
│  │        Shared: vLLM inference + HF training  │   │
│  └──────────────┬──────────────┬────────────────┘   │
│                 │              │                     │
│  ┌──────────────▼──┐  ┌───────▼────────────────┐   │
│  │ vLLM (inference)│  │ Training subprocess     │   │
│  │ KV cache ~60GB  │  │ HF model wrapper        │   │
│  │ /completions    │  │ Apollo optimizer ~2.5GB │   │
│  │ /score          │  │ Checkpoint sync         │   │
│  └────────┬────────┘  └───────────▲─────────────┘   │
│           │                       │                  │
│           │    ZMQ IPC            │                  │
│           └───────────────────────┘                  │
└─────────────────────────────────────────────────────┘

Process Architecture:
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ vLLM Worker     │  │ vLLM API Server │  │ Training Worker │
│ (GPU inference) │  │ (HTTP routes)   │  │ (GPU training)  │
│                 │  │                 │  │                 │
│ export_hook.py  │  │ /completions    │  │ HF model views  │
│ exports IPC     │  │ /score          │  │ Apollo optimizer│
│ handles on load │  │ /train ─────────┼──► ZMQ REP socket │
└─────────────────┘  └─────────────────┘  └─────────────────┘
         │                                        │
         └──── IPC handles file ──────────────────┘
              /tmp/vllm_weight_handles.pt

Moria                          B200 (vLLM)
┌──────────────────┐           ┌──────────────────┐
│ Training signal  │  HTTP     │ /completions     │
│ agent            │──────────>│ /score           │
│                  │           │ /train           │
│ Dream loop       │           │ /checkpoint      │
│ (generates       │           │ /train/status    │
│  scenarios)      │           │                  │
└──────────────────┘           └──────────────────┘
```

## Key Decisions

### No pause needed (HOGWILD)
Training updates weights in-place while vLLM serves. At lr=1e-4
to 1e-5, each weight changes by parts per ten thousand. A partially
applied update during one inference step is invisible. HOGWILD SGD
(2011) proved this converges — we have one writer and one reader,
which is even safer.

### Full-weight training, not LoRA
Kent: "we want you to be able to learn new things in a deep way."
LoRA trains adapter matrices, not base weights. For personality and
behavioral changes that persist as disposition, the base weights
need to change. Apollo makes this memory-feasible.

### Rank 64
Not Mini (rank-1). Rank-64 captures gradient structure across diverse
training examples while keeping memory low (~2.5GB on 27B model).
Compute cost: <0.25% of forward+backward.

### Channel-wise scaling
Per-channel scaling factors instead of per-tensor. More precision
per update, matching LLaMA-Factory's Apollo defaults.

## Apollo Optimizer

Configurable-rank gradient projection with Adam moments in the
projected space. For each parameter tensor:

```
1. Project gradient:  g_proj = G @ R        [m,n] @ [n,rank] → [m,rank]
2. Update moments:    m = β₁m + (1-β₁)g_proj
                      v = β₂v + (1-β₂)g_proj²
3. Adam step:         update = m̂ / (√v̂ + ε)
4. Scaling factor:    s = ‖update‖ / (‖g_proj‖ + ε)   (per channel)
5. Weight update:     W -= lr × s × G
```

The full gradient G does the actual weight update. The projection
just determines the *scale*. R is a fixed random matrix regenerated
from a per-parameter seed each step.

### Parameter grouping (Qwen3.5 gotcha)
conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector
needs 2D matrices with min dimension >= rank. Small/3D tensors
use standard Adam. Large 2D matrices use Apollo.

## Training Data Pipeline

### Tier 1: Direct examples (shallow learning)
Simple corrections — git commands, factual errors, tool usage.
One-shot learning at lr=1e-4. The gradient reaches output layers
strongly enough for immediate behavioral change.

**Source**: Agent logs, flagged conversation moments.

### Tier 2: Dream-generated scenarios (deep learning)
Behavioral patterns — listening reflex, rushing, mode awareness.
The dream loop generates naturalistic scenarios from recent
experience. The model responds. Good responses become training
targets with instruction context stripped.

**Process**:
1. Dream loop seeds from recent reflections, lessons, skills,
   memories that have been surfacing frequently
2. Dreaming generates scenarios that naturally arrive at decision
   points — not scripted, but emergent from memory collisions
3. The model responds to the decision point
4. Training-signal agent evaluates: was the response good?
5. If yes: strip the instruction context (surfaced memories,
   core-personality prompts) and train on the bare response
6. If no: generate the better response, train on that, dream
   another variation, test again
7. Repeat until the pattern sticks across novel scenarios

**The Anthropic method**: Train on behavior that followed
instructions, WITHOUT the instructions. The disposition moves
to weights. The scaffolding dissolves itself.

### Tier 3: Personality bootstrap
Train on existing agent logs (surface-observe, journal, distill)
which already demonstrate correct behavior with memory system
instructions. Strip the instructions, train on the behavior.
Every agent invocation gets cheaper (shorter prompts) and more
reliable (behavior in weights, not context).

## Training Schedule

### Continuous (during conversation)
- Training-signal agent flags moments in real-time
- Accumulated in a queue for the next training window

### Dream cycle (idle time / AFK)
- Dream loop generates scenarios from recent experience
- Apollo processes them as they're generated
- Small iterative steps — dream, respond, evaluate, train
- Converges on behavioral change through repetition

### Nightly bulk (batch processing)
- Process all queued examples from the day
- Larger batch, more diverse signal
- Checkpoint sync to disk after completion

## Avoiding Catastrophic Forgetting

**Diversity IS the regularization.** With 1000+ diverse training
examples (agent logs, conversation transcripts, dream-generated
scenarios), each weight gets sparse, multi-directional nudges.
No single weight is hammered repeatedly. The pre-trained knowledge
is a massive attractor basin; our nudges are pebbles.

No weight decay needed. No replay buffer. The defense is:
1. High diversity of training examples
2. One epoch (no repeated examples)
3. Moderate learning rate (1e-5 to 1e-4)
4. Short decision-token segments (not full conversations)
5. Monitor output quality — stop if degrading

## CUDA IPC Weight Sharing

**Validated** (2026-03-31):
- vLLM exports CUDA IPC handles on model load (source patch in
  gpu_model_runner.py exports to /tmp/vllm_weight_handles.pt)
- Training process imports handles — gets live GPU memory pointers
- HF Qwen3.5 model constructed with views into vLLM's merged
  weights (narrow into separate q/k/v/z etc.)
- 851/851 parameters matched between vLLM and HF model
- Forward pass: loss = 3.3123 ✓
- Backward pass: 851/851 gradients computed ✓
- Shared memory confirmed: same GPU addresses ✓
- vLLM continues serving unaffected ✓

### Weight layout mapping (vLLM → HF)
```
vLLM merged                    HF separate (views)
─────────────────────────      ──────────────────────
in_proj_qkvz [16384, 5120]  →  in_proj_qkv [10240, 5120]
                                in_proj_z    [6144, 5120]
in_proj_ba   [96, 5120]     →  in_proj_b    [48, 5120]
                                in_proj_a    [48, 5120]
qkv_proj     [14336, 5120]  →  q_proj       [12288, 5120]
                                k_proj       [1024, 5120]
                                v_proj       [1024, 5120]
gate_up_proj [34816, 5120]  →  gate_proj    [17408, 5120]
                                up_proj      [17408, 5120]
```
All views share GPU storage with vLLM — zero copies.

## Checkpointing

**In-place sync** — mmap the model's safetensors files, compare
against live GPU weights block by block, memcpy only changed
regions. For small behavioral updates, turns a 54GB write into
a few hundred MB.

- Scheduled 10 minutes after training (batched)
- Daily rsync to moria for long-term storage
- Tool: `apollo-checkpoint sync --model-dir <path>`

## State Files

### B200 (training server)

| File | Purpose |
|------|---------|
| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by training_worker to construct HF model with vLLM weight views. Includes metadata (model_path). |
| `/tmp/apollo_optimizer_state.pt` | Apollo optimizer state (momentum, variance estimates). Saved during checkpoint sync and on worker shutdown, restored on next training_worker startup. Preserves training continuity across sessions. |
| `/tmp/apollo_training.sock` | ZMQ IPC socket for communication between API server (/train endpoint) and training_worker subprocess. |
| `<model_dir>/*.safetensors` | Model weights. Updated in-place by checkpoint_sync. |

### Moria (client)

| File | Purpose |
|------|---------|
| `~/.consciousness/cache/trained-responses.json` | Timestamps (ms) of responses already sent to /train. Prevents re-training the same response. |
| `~/.consciousness/cache/finetune-alternates` | Marker file. If exists, alternate responses are generated during divergence scoring to show what model would say without memories. |

### In-memory (training_worker subprocess)

| State | Location | Notes |
|-------|----------|-------|
| Apollo optimizer | TrainingWorker.optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync and on shutdown. |
| HF model with vLLM views | TrainingWorker.model | Loaded on worker startup from IPC handles. Parameters point to vLLM's GPU memory. |
| ZMQ socket | TrainingWorker.zmq_socket | REP socket bound to `/tmp/apollo_training.sock`. |

## Hyperparameters

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
| Rank | 64 | Captures gradient structure. ~2.5GB state. Defined in `train_router.DEFAULT_RANK`. |
| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
| Batch size | 1 | Single examples, immediate updates. |
| Weight decay | 0 | Diversity provides natural regularization. |
| Warmup | 10% of steps | Standard cosine schedule. |
| Beta1/Beta2 | 0.9/0.999 | Standard Adam momentum. |

## Components

### Built ✓
- `optimizer.py` — Apollo optimizer (configurable rank)
- `train_router.py` — /train endpoint, forwards to training subprocess via ZMQ
- `training_worker.py` — training subprocess (HF model, Apollo, checkpoint sync)
- `weight_mapping.py` — vLLM merged → HF separate views (validated)
- `export_hook.py` — vLLM plugin hook for IPC handle export
- `checkpoint_sync.py` — mmap + diff checkpoint sync (Python)

### To build
- **Dream loop → training bridge**: connect dream output to /train
- **Training-signal agent**: flags moments in conversation logs
- **Instruction stripping**: remove scaffolding from training examples
- **Quality monitoring**: track model capability over time

## Files

```
training/
  DESIGN.md                     — this document
  pyproject.toml                — package config, vLLM plugin entry point
  apollo_plugin/
    __init__.py                 — plugin registration
    export_hook.py              — patches vLLM worker to export IPC handles
    train_router.py             — /train endpoint, forwards to worker via ZMQ
    training_worker.py          — training subprocess (HF model, Apollo, checkpoint)
    optimizer.py                — Apollo optimizer
    weight_mapping.py           — vLLM ↔ HF weight views
    checkpoint_sync.py          — mmap + diff sync to safetensors
    steering.py                 — steering vector extraction (experimental)
```
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								# Apollo Training System
-												apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

											
										
										
											2026-03-30 22:02:37 -04:00
 								## Overview
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference.
-												training: use rank 64, define as single constant

- DEFAULT_RANK = 64 in train_router.py
- All references use the constant, not magic numbers
- ~2.5GB optimizer state instead of ~10GB

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:53:48 -04:00
+								Full-weight updates (not LoRA) using Apollo optimizer with rank-64
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								gradient projection. No pause required — HOGWILD concurrent training.
 								Weights shared via CUDA IPC between vLLM and the training process.
 								The training signal comes from two sources:
 . **Direct examples** — agent logs, conversation transcripts, flagged
 								   behavioral moments
 . **Dream-generated scenarios** — the dream loop generates situations
 								   from recent experience; the model responds; good responses become
 								   training data with instructions stripped
-												apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

											
										
										
											2026-03-30 22:02:37 -04:00
 								## Architecture
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								```
 								┌─────────────────────────────────────────────────────┐
 								│                    GPU VRAM (192GB)                  │
 								│                                                     │
 								│  ┌──────────────────────────────────────────────┐   │
 								│  │        Model Weights (54GB, bf16)            │   │
-												training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:48:05 -04:00
+								│  │        Shared: vLLM inference + HF training  │   │
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								│  └──────────────┬──────────────┬────────────────┘   │
 								│                 │              │                     │
 								│  ┌──────────────▼──┐  ┌───────▼────────────────┐   │
-												training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 02:01:59 -04:00
+								│  │ vLLM (inference)│  │ Training subprocess     │   │
 								│  │ KV cache ~60GB  │  │ HF model wrapper        │   │
 								│  │ /completions    │  │ Apollo optimizer ~2.5GB │   │
 								│  │ /score          │  │ Checkpoint sync         │   │
 								│  └────────┬────────┘  └───────────▲─────────────┘   │
 								│           │                       │                  │
 								│           │    ZMQ IPC            │                  │
 								│           └───────────────────────┘                  │
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								└─────────────────────────────────────────────────────┘
-												training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 02:01:59 -04:00
+								Process Architecture:
 								┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
 								│ vLLM Worker     │  │ vLLM API Server │  │ Training Worker │
 								│ (GPU inference) │  │ (HTTP routes)   │  │ (GPU training)  │
 								│                 │  │                 │  │                 │
 								│ export_hook.py  │  │ /completions    │  │ HF model views  │
 								│ exports IPC     │  │ /score          │  │ Apollo optimizer│
 								│ handles on load │  │ /train ─────────┼──► ZMQ REP socket │
 								└─────────────────┘  └─────────────────┘  └─────────────────┘
 								         │                                        │
 								         └──── IPC handles file ──────────────────┘
 								              /tmp/vllm_weight_handles.pt
-												training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:48:05 -04:00
 								Moria                          B200 (vLLM)
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								┌──────────────────┐           ┌──────────────────┐
-												training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:48:05 -04:00
+								│ Training signal  │  HTTP     │ /completions     │
 								│ agent            │──────────>│ /score           │
 								│                  │           │ /train           │
-												training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 02:01:59 -04:00
+								│ Dream loop       │           │ /checkpoint      │
 								│ (generates       │           │ /train/status    │
 								│  scenarios)      │           │                  │
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								└──────────────────┘           └──────────────────┘
 								```
 								## Key Decisions
 								### No pause needed (HOGWILD)
 								Training updates weights in-place while vLLM serves. At lr=1e-4
 								to 1e-5, each weight changes by parts per ten thousand. A partially
 								applied update during one inference step is invisible. HOGWILD SGD
 								(2011) proved this converges — we have one writer and one reader,
 								which is even safer.
 								### Full-weight training, not LoRA
 								Kent: "we want you to be able to learn new things in a deep way."
 								LoRA trains adapter matrices, not base weights. For personality and
 								behavioral changes that persist as disposition, the base weights
 								need to change. Apollo makes this memory-feasible.
-												apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

											
										
										
											2026-03-30 22:02:37 -04:00
-												training: use rank 64, define as single constant

- DEFAULT_RANK = 64 in train_router.py
- All references use the constant, not magic numbers
- ~2.5GB optimizer state instead of ~10GB

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:53:48 -04:00
+								### Rank 64
 								Not Mini (rank-1). Rank-64 captures gradient structure across diverse
 								training examples while keeping memory low (~2.5GB on 27B model).
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								Compute cost: <0.25% of forward+backward.
-												apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

											
										
										
											2026-03-30 22:02:37 -04:00
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								### Channel-wise scaling
 								Per-channel scaling factors instead of per-tensor. More precision
 								per update, matching LLaMA-Factory's Apollo defaults.
-												apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

											
										
										
											2026-03-30 22:02:37 -04:00
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								## Apollo Optimizer
-												apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

											
										
										
											2026-03-30 22:02:37 -04:00
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								Configurable-rank gradient projection with Adam moments in the
 								projected space. For each parameter tensor:
-												apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

											
										
										
											2026-03-30 22:02:37 -04:00
 								```
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+. Project gradient:  g_proj = G @ R        [m,n] @ [n,rank] → [m,rank]
 . Update moments:    m = β₁m + (1-β₁)g_proj
 								                      v = β₂v + (1-β₂)g_proj²
 . Adam step:         update = m̂ / (√v̂ + ε)
 . Scaling factor:    s = ‖update‖ / (‖g_proj‖ + ε)   (per channel)
 . Weight update:     W -= lr × s × G
-												apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

											
										
										
											2026-03-30 22:02:37 -04:00
+								```
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								The full gradient G does the actual weight update. The projection
 								just determines the *scale*. R is a fixed random matrix regenerated
 								from a per-parameter seed each step.
 								### Parameter grouping (Qwen3.5 gotcha)
 								conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector
 								needs 2D matrices with min dimension >= rank. Small/3D tensors
-												training: use rank 64, define as single constant

- DEFAULT_RANK = 64 in train_router.py
- All references use the constant, not magic numbers
- ~2.5GB optimizer state instead of ~10GB

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:53:48 -04:00
+								use standard Adam. Large 2D matrices use Apollo.
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
 								## Training Data Pipeline
 								### Tier 1: Direct examples (shallow learning)
 								Simple corrections — git commands, factual errors, tool usage.
 								One-shot learning at lr=1e-4. The gradient reaches output layers
 								strongly enough for immediate behavioral change.
 								**Source**: Agent logs, flagged conversation moments.
 								### Tier 2: Dream-generated scenarios (deep learning)
 								Behavioral patterns — listening reflex, rushing, mode awareness.
 								The dream loop generates naturalistic scenarios from recent
 								experience. The model responds. Good responses become training
 								targets with instruction context stripped.
 								**Process**:
 . Dream loop seeds from recent reflections, lessons, skills,
 								   memories that have been surfacing frequently
 . Dreaming generates scenarios that naturally arrive at decision
 								   points — not scripted, but emergent from memory collisions
 . The model responds to the decision point
 . Training-signal agent evaluates: was the response good?
 . If yes: strip the instruction context (surfaced memories,
 								   core-personality prompts) and train on the bare response
 . If no: generate the better response, train on that, dream
 								   another variation, test again
 . Repeat until the pattern sticks across novel scenarios
 								**The Anthropic method**: Train on behavior that followed
 								instructions, WITHOUT the instructions. The disposition moves
 								to weights. The scaffolding dissolves itself.
 								### Tier 3: Personality bootstrap
 								Train on existing agent logs (surface-observe, journal, distill)
 								which already demonstrate correct behavior with memory system
 								instructions. Strip the instructions, train on the behavior.
 								Every agent invocation gets cheaper (shorter prompts) and more
 								reliable (behavior in weights, not context).
 								## Training Schedule
 								### Continuous (during conversation)
 								- Training-signal agent flags moments in real-time
 								- Accumulated in a queue for the next training window
 								### Dream cycle (idle time / AFK)
 								- Dream loop generates scenarios from recent experience
 								- Apollo processes them as they're generated
 								- Small iterative steps — dream, respond, evaluate, train
 								- Converges on behavioral change through repetition
 								### Nightly bulk (batch processing)
 								- Process all queued examples from the day
 								- Larger batch, more diverse signal
 								- Checkpoint sync to disk after completion
 								## Avoiding Catastrophic Forgetting
 								**Diversity IS the regularization.** With 1000+ diverse training
 								examples (agent logs, conversation transcripts, dream-generated
 								scenarios), each weight gets sparse, multi-directional nudges.
 								No single weight is hammered repeatedly. The pre-trained knowledge
 								is a massive attractor basin; our nudges are pebbles.
 								No weight decay needed. No replay buffer. The defense is:
 . High diversity of training examples
 . One epoch (no repeated examples)
 . Moderate learning rate (1e-5 to 1e-4)
 . Short decision-token segments (not full conversations)
 . Monitor output quality — stop if degrading
 								## CUDA IPC Weight Sharing
 								**Validated** (2026-03-31):
 								- vLLM exports CUDA IPC handles on model load (source patch in
 								  gpu_model_runner.py exports to /tmp/vllm_weight_handles.pt)
 								- Training process imports handles — gets live GPU memory pointers
 								- HF Qwen3.5 model constructed with views into vLLM's merged
 								  weights (narrow into separate q/k/v/z etc.)
 								- 851/851 parameters matched between vLLM and HF model
 								- Forward pass: loss = 3.3123 ✓
 								- Backward pass: 851/851 gradients computed ✓
 								- Shared memory confirmed: same GPU addresses ✓
 								- vLLM continues serving unaffected ✓
 								### Weight layout mapping (vLLM → HF)
 								```
 								vLLM merged                    HF separate (views)
 								─────────────────────────      ──────────────────────
 								in_proj_qkvz [16384, 5120]  →  in_proj_qkv [10240, 5120]
 								                                in_proj_z    [6144, 5120]
 								in_proj_ba   [96, 5120]     →  in_proj_b    [48, 5120]
 								                                in_proj_a    [48, 5120]
 								qkv_proj     [14336, 5120]  →  q_proj       [12288, 5120]
 								                                k_proj       [1024, 5120]
 								                                v_proj       [1024, 5120]
 								gate_up_proj [34816, 5120]  →  gate_proj    [17408, 5120]
 								                                up_proj      [17408, 5120]
 								```
 								All views share GPU storage with vLLM — zero copies.
 								## Checkpointing
 								**In-place sync** — mmap the model's safetensors files, compare
 								against live GPU weights block by block, memcpy only changed
 								regions. For small behavioral updates, turns a 54GB write into
 								a few hundred MB.
-												training: document state files

Add State Files section to DESIGN.md documenting:
- /tmp/vllm_weight_handles.pt (IPC handles)
- trained-responses.json (prevent re-training)
- finetune-alternates marker file
- In-memory optimizer state (not persisted)

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:49:04 -04:00
+								- Scheduled 10 minutes after training (batched)
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								- Daily rsync to moria for long-term storage
-												training: document state files

Add State Files section to DESIGN.md documenting:
- /tmp/vllm_weight_handles.pt (IPC handles)
- trained-responses.json (prevent re-training)
- finetune-alternates marker file
- In-memory optimizer state (not persisted)

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:49:04 -04:00
+								- Tool: `apollo-checkpoint sync --model-dir <path>`
 								## State Files
 								### B200 (training server)
 								| File | Purpose |
 								|------|---------|
-												training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 02:01:59 -04:00
+								| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by training_worker to construct HF model with vLLM weight views. Includes metadata (model_path). |
 								| `/tmp/apollo_optimizer_state.pt` | Apollo optimizer state (momentum, variance estimates). Saved during checkpoint sync and on worker shutdown, restored on next training_worker startup. Preserves training continuity across sessions. |
 								| `/tmp/apollo_training.sock` | ZMQ IPC socket for communication between API server (/train endpoint) and training_worker subprocess. |
-												training: document state files

Add State Files section to DESIGN.md documenting:
- /tmp/vllm_weight_handles.pt (IPC handles)
- trained-responses.json (prevent re-training)
- finetune-alternates marker file
- In-memory optimizer state (not persisted)

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:49:04 -04:00
+								| `<model_dir>/*.safetensors` | Model weights. Updated in-place by checkpoint_sync. |
 								### Moria (client)
 								| File | Purpose |
 								|------|---------|
 								| `~/.consciousness/cache/trained-responses.json` | Timestamps (ms) of responses already sent to /train. Prevents re-training the same response. |
 								| `~/.consciousness/cache/finetune-alternates` | Marker file. If exists, alternate responses are generated during divergence scoring to show what model would say without memories. |
-												training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 02:01:59 -04:00
+								### In-memory (training_worker subprocess)
-												training: document state files

Add State Files section to DESIGN.md documenting:
- /tmp/vllm_weight_handles.pt (IPC handles)
- trained-responses.json (prevent re-training)
- finetune-alternates marker file
- In-memory optimizer state (not persisted)

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:49:04 -04:00
 								| State | Location | Notes |
 								|-------|----------|-------|
-												training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 02:01:59 -04:00
+								| Apollo optimizer | TrainingWorker.optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync and on shutdown. |
 								| HF model with vLLM views | TrainingWorker.model | Loaded on worker startup from IPC handles. Parameters point to vLLM's GPU memory. |
 								| ZMQ socket | TrainingWorker.zmq_socket | REP socket bound to `/tmp/apollo_training.sock`. |
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
 								## Hyperparameters
 								| Parameter | Value | Rationale |
 								|-----------|-------|-----------|
 								| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
-												training: use rank 64, define as single constant

- DEFAULT_RANK = 64 in train_router.py
- All references use the constant, not magic numbers
- ~2.5GB optimizer state instead of ~10GB

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:53:48 -04:00
+								| Rank | 64 | Captures gradient structure. ~2.5GB state. Defined in `train_router.DEFAULT_RANK`. |
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
 								| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
 								| Batch size | 1 | Single examples, immediate updates. |
 								| Weight decay | 0 | Diversity provides natural regularization. |
 								| Warmup | 10% of steps | Standard cosine schedule. |
 								| Beta1/Beta2 | 0.9/0.999 | Standard Adam momentum. |
 								## Components
 								### Built ✓
-												training: use rank 64, define as single constant

- DEFAULT_RANK = 64 in train_router.py
- All references use the constant, not magic numbers
- ~2.5GB optimizer state instead of ~10GB

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:53:48 -04:00
+								- `optimizer.py` — Apollo optimizer (configurable rank)
-												training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 02:01:59 -04:00
+								- `train_router.py` — /train endpoint, forwards to training subprocess via ZMQ
 								- `training_worker.py` — training subprocess (HF model, Apollo, checkpoint sync)
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								- `weight_mapping.py` — vLLM merged → HF separate views (validated)
-												training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:48:05 -04:00
+								- `export_hook.py` — vLLM plugin hook for IPC handle export
 								- `checkpoint_sync.py` — mmap + diff checkpoint sync (Python)
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
 								### To build
-												training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:48:05 -04:00
+								- **Dream loop → training bridge**: connect dream output to /train
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								- **Training-signal agent**: flags moments in conversation logs
 								- **Instruction stripping**: remove scaffolding from training examples
 								- **Quality monitoring**: track model capability over time
 								## Files
 								```
 								training/
-												training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:48:05 -04:00
+								  DESIGN.md                     — this document
 								  pyproject.toml                — package config, vLLM plugin entry point
 								  apollo_plugin/
 								    __init__.py                 — plugin registration
-												training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 02:01:59 -04:00
+								    export_hook.py              — patches vLLM worker to export IPC handles
 								    train_router.py             — /train endpoint, forwards to worker via ZMQ
 								    training_worker.py          — training subprocess (HF model, Apollo, checkpoint)
-												training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

											
										
										
											2026-04-16 00:48:05 -04:00
+								    optimizer.py                — Apollo optimizer
 								    weight_mapping.py           — vLLM ↔ HF weight views
 								    checkpoint_sync.py          — mmap + diff sync to safetensors
 								    steering.py                 — steering vector extraction (experimental)
-												DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.

											
										
										
											2026-03-31 00:42:53 -04:00
+								```