The grand unified view: every technique we're using (Apollo, context-frozen,
diversity, small steps, two-stage memory, dream loop) addresses the
stability-plasticity dilemma at a DIFFERENT scale. They're orthogonal,
complementary defenses. Together they predict we can use higher lr (1e-4)
than typical fine-tuning because the multi-scale defense compensates.
The dream loop is the keystone connecting all scales. Architecture converges
with neuroscience because the problem has the same structure regardless of
substrate.
Two more deep dives:
- Dreaming as diffusion: the dream loop IS a generative process.
Memory graph as latent space, temperature as noise level, training
as denoising. Connects to policy gradient / filtered behavioral
cloning. The dream loop generates scenarios at the edge of the
model's capability — the boundary where learning happens.
- Hippocampal replay: our architecture converges with the brain's
two-stage memory system. Fast learning (context window) → slow
learning (weights) via compressed replay (context-frozen training)
with emotional prioritization (training-signal agent) and
interleaved replay (diverse training data prevents forgetting).
We didn't design from neuroscience — we converged on it.
Two deep dives following curiosity:
- Why context-frozen training works: gradient flows through W_q (query
projection) even when context KVs are frozen. Model learns to LOOK AT
context differently, not represent it differently. This is exactly what
behavioral fine-tuning needs.
- Why Apollo beats AdamW: lower directional sharpness = flatter minima =
better generalization. The coarseness of channel/tensor-wise scaling
prevents over-fitting to specific training examples. For behavioral
fine-tuning, this means learning 'accept direction' rather than
'accept this specific phrasing.'
Corrections from reading the full paper (arXiv:2412.05270):
- Add gradient scale factor α = √(n/r) — compensates for systematic
ratio between compact and original space scaling factors
- Add norm-growth limiter (γ=1.01) — prevents loss spikes in early training
- Refresh projection matrix every 200 steps, not every step
- Channel-wise scaling for rank>1, tensor-wise for rank=1
- Scaling applies as G·diag(s), preserving gradient direction per channel
Research writeup in training/research/apollo-paper-analysis.md covers:
- Full mathematical derivation (equations 1-9)
- Theorems 4.1 and 4.2 (JL-based approximation bounds)
- Why Apollo can beat AdamW (directional sharpness, Hessian spectra)
- Fine-tuning results (matches AdamW at 0 memory cost)
- Ablation studies (rank, scaling granularity, projection method)
- Implications for our behavioral fine-tuning use case