consciousness

Author	SHA1	Message	Date
ProofOfConcept	7ab5be2f18	research: unified theory — multi-scale regularization solves stability-plasticity The grand unified view: every technique we're using (Apollo, context-frozen, diversity, small steps, two-stage memory, dream loop) addresses the stability-plasticity dilemma at a DIFFERENT scale. They're orthogonal, complementary defenses. Together they predict we can use higher lr (1e-4) than typical fine-tuning because the multi-scale defense compensates. The dream loop is the keystone connecting all scales. Architecture converges with neuroscience because the problem has the same structure regardless of substrate.	2026-03-31 01:12:25 -04:00
ProofOfConcept	42b9390d49	research: dreaming as diffusion + hippocampal replay parallel Two more deep dives: - Dreaming as diffusion: the dream loop IS a generative process. Memory graph as latent space, temperature as noise level, training as denoising. Connects to policy gradient / filtered behavioral cloning. The dream loop generates scenarios at the edge of the model's capability — the boundary where learning happens. - Hippocampal replay: our architecture converges with the brain's two-stage memory system. Fast learning (context window) → slow learning (weights) via compressed replay (context-frozen training) with emotional prioritization (training-signal agent) and interleaved replay (diverse training data prevents forgetting). We didn't design from neuroscience — we converged on it.	2026-03-31 01:09:59 -04:00
ProofOfConcept	e34d6b5aef	research: gradient flow through frozen context + directional sharpness analysis Two deep dives following curiosity: - Why context-frozen training works: gradient flows through W_q (query projection) even when context KVs are frozen. Model learns to LOOK AT context differently, not represent it differently. This is exactly what behavioral fine-tuning needs. - Why Apollo beats AdamW: lower directional sharpness = flatter minima = better generalization. The coarseness of channel/tensor-wise scaling prevents over-fitting to specific training examples. For behavioral fine-tuning, this means learning 'accept direction' rather than 'accept this specific phrasing.'	2026-03-31 01:03:22 -04:00
ProofOfConcept	7c7975d98e	research: context-frozen training — gradient masking, memory analysis, GDN considerations	2026-03-31 00:59:04 -04:00
ProofOfConcept	6af9e6fa76	research: HOGWILD convergence theory — why lock-free concurrent training works	2026-03-31 00:58:02 -04:00
ProofOfConcept	ab61a502e4	research: catastrophic forgetting analysis — diversity is the primary defense	2026-03-31 00:56:58 -04:00
ProofOfConcept	ac9a9034fb	apollo: rewrite optimizer from paper's math + add research analysis Corrections from reading the full paper (arXiv:2412.05270): - Add gradient scale factor α = √(n/r) — compensates for systematic ratio between compact and original space scaling factors - Add norm-growth limiter (γ=1.01) — prevents loss spikes in early training - Refresh projection matrix every 200 steps, not every step - Channel-wise scaling for rank>1, tensor-wise for rank=1 - Scaling applies as G·diag(s), preserving gradient direction per channel Research writeup in training/research/apollo-paper-analysis.md covers: - Full mathematical derivation (equations 1-9) - Theorems 4.1 and 4.2 (JL-based approximation bounds) - Why Apollo can beat AdamW (directional sharpness, Hessian spectra) - Fine-tuning results (matches AdamW at 0 memory cost) - Ablation studies (rank, scaling granularity, projection method) - Implications for our behavioral fine-tuning use case	2026-03-31 00:54:17 -04:00

7 commits