From 6af9e6fa76c6a65480e0e8a8642c59fee6f7dd31 Mon Sep 17 00:00:00 2001
From: ProofOfConcept <poc@bcachefs.org>
Date: Tue, 31 Mar 2026 00:58:02 -0400
Subject: [PATCH] =?UTF-8?q?research:=20HOGWILD=20convergence=20theory=20?=
 =?UTF-8?q?=E2=80=94=20why=20lock-free=20concurrent=20training=20works?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 training/research/hogwild-convergence.md | 114 +++++++++++++++++++++++
 1 file changed, 114 insertions(+)
 create mode 100644 training/research/hogwild-convergence.md

diff --git a/training/research/hogwild-convergence.md b/training/research/hogwild-convergence.md
new file mode 100644
index 0000000..568dd13
--- /dev/null
+++ b/training/research/hogwild-convergence.md
@@ -0,0 +1,114 @@
+# HOGWILD! Convergence Theory
+
+Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
+Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730)
+
+## The Setup
+
+Multiple processors read and write to shared memory containing model
+parameters, with NO locks, NO synchronization. Each processor:
+1. Reads the current parameter vector (possibly stale)
+2. Samples a training example
+3. Computes a gradient
+4. Writes the update directly to shared memory
+
+Other processors may have modified the parameters between steps 1 and 4.
+The reads are inconsistent (reading a mix of old and new values). The
+writes may overwrite each other.
+
+## Why It Works
+
+### Sparsity condition
+
+The key assumption: each gradient update only modifies a small fraction
+of the total parameters. Formally, if the gradient of example e has
+support set s(e) (the non-zero entries), then the updates are sparse
+if |s(e)| << total parameters.
+
+When updates are sparse, the probability of two concurrent updates
+modifying the same parameter is low. Most of the time, the processors
+are writing to different parameters, so no information is lost.
+
+### Our case: even better than HOGWILD
+
+HOGWILD assumes multiple writers competing. We have:
+- **One writer** (the training process applying Apollo updates)
+- **One reader** (vLLM running inference)
+
+This is strictly easier than HOGWILD's multi-writer scenario. The only
+risk is the reader seeing a partially-updated parameter tensor during
+inference. But:
+
+1. GPU writes are coalesced — a CUDA kernel writes to parameter memory
+   in large contiguous blocks, not one element at a time
+2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1))
+3. A partially-applied update is within rounding error of either the
+   old or new value
+4. One slightly noisy token in a conversation is invisible
+
+### Convergence guarantee
+
+Under the sparsity condition and standard smoothness assumptions
+(Lipschitz continuous gradients), HOGWILD achieves convergence rate:
+
+```
+E[f(x_t) - f*] ≤ O(1 / (ε·t))
+```
+
+where t is the number of updates and ε measures the conflict rate.
+With sparse updates and few processors, ε ≈ 1 and the rate matches
+standard SGD.
+
+The "nearly optimal" means the convergence rate is within a constant
+factor of serial SGD — the parallelism costs almost nothing in terms
+of convergence quality.
+
+## Practical implications for our system
+
+### 1. No pause needed
+
+vLLM inference reads weights while Apollo writes them. The convergence
+guarantee holds because our gradient updates are sparse (each training
+example only generates meaningful gradients for a small fraction of
+the 27B parameters).
+
+### 2. No lock needed
+
+Not even a mutex. The worst case (vLLM reads mid-write) produces one
+token from a partially-updated model. The next token uses the fully-
+updated model. There is no accumulation of error — each inference step
+is independent.
+
+### 3. Training can be continuous
+
+Not batched nightly, not time-multiplexed. The Apollo daemon can
+process training examples as they arrive, updating weights
+continuously. vLLM never pauses, never notices.
+
+### 4. The sparsity comes for free
+
+Context-frozen training on short decision segments (50-256 tokens)
+naturally produces sparse gradients. Most weights are near-zero
+gradient because only a small fraction of the model is "active" for
+any given short sequence. This satisfies HOGWILD's sparsity condition
+without any special engineering.
+
+## Connection to Apollo
+
+Apollo's channel-wise scaling further helps convergence in the
+concurrent setting. Because the scaling factors are computed per
+channel (not per element), the update pattern has additional structure
+that reduces the effective conflict rate between the training writer
+and the inference reader.
+
+The norm-growth limiter (γ=1.01) also helps — it prevents any single
+training step from making a large change that could temporarily
+destabilize inference. Each update is bounded to be at most 1% larger
+than the previous one.
+
+## References
+
+- Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
+  Stochastic Gradient Descent" (NIPS 2011)
+- The term "HOGWILD" comes from running SGD "hog wild" — completely
+  uncontrolled parallelism