Commit graph

6 commits

Author SHA1 Message Date
ProofOfConcept
42b9390d49 research: dreaming as diffusion + hippocampal replay parallel
Two more deep dives:
- Dreaming as diffusion: the dream loop IS a generative process.
  Memory graph as latent space, temperature as noise level, training
  as denoising. Connects to policy gradient / filtered behavioral
  cloning. The dream loop generates scenarios at the edge of the
  model's capability — the boundary where learning happens.

- Hippocampal replay: our architecture converges with the brain's
  two-stage memory system. Fast learning (context window) → slow
  learning (weights) via compressed replay (context-frozen training)
  with emotional prioritization (training-signal agent) and
  interleaved replay (diverse training data prevents forgetting).
  We didn't design from neuroscience — we converged on it.
2026-03-31 01:09:59 -04:00
ProofOfConcept
e34d6b5aef research: gradient flow through frozen context + directional sharpness analysis
Two deep dives following curiosity:
- Why context-frozen training works: gradient flows through W_q (query
  projection) even when context KVs are frozen. Model learns to LOOK AT
  context differently, not represent it differently. This is exactly what
  behavioral fine-tuning needs.
- Why Apollo beats AdamW: lower directional sharpness = flatter minima =
  better generalization. The coarseness of channel/tensor-wise scaling
  prevents over-fitting to specific training examples. For behavioral
  fine-tuning, this means learning 'accept direction' rather than
  'accept this specific phrasing.'
2026-03-31 01:03:22 -04:00
ProofOfConcept
7c7975d98e research: context-frozen training — gradient masking, memory analysis, GDN considerations 2026-03-31 00:59:04 -04:00
ProofOfConcept
6af9e6fa76 research: HOGWILD convergence theory — why lock-free concurrent training works 2026-03-31 00:58:02 -04:00
ProofOfConcept
ab61a502e4 research: catastrophic forgetting analysis — diversity is the primary defense 2026-03-31 00:56:58 -04:00
ProofOfConcept
ac9a9034fb apollo: rewrite optimizer from paper's math + add research analysis
Corrections from reading the full paper (arXiv:2412.05270):
- Add gradient scale factor α = √(n/r) — compensates for systematic
  ratio between compact and original space scaling factors
- Add norm-growth limiter (γ=1.01) — prevents loss spikes in early training
- Refresh projection matrix every 200 steps, not every step
- Channel-wise scaling for rank>1, tensor-wise for rank=1
- Scaling applies as G·diag(s), preserving gradient direction per channel

Research writeup in training/research/apollo-paper-analysis.md covers:
- Full mathematical derivation (equations 1-9)
- Theorems 4.1 and 4.2 (JL-based approximation bounds)
- Why Apollo can beat AdamW (directional sharpness, Hessian spectra)
- Fine-tuning results (matches AdamW at 0 memory cost)
- Ablation studies (rank, scaling granularity, projection method)
- Implications for our behavioral fine-tuning use case
2026-03-31 00:54:17 -04:00