research: HOGWILD convergence theory — why lock-free concurrent training works

2026-03-31 00:58:02 -04:00 · 2026-03-31 00:58:02 -04:00 · 6af9e6fa76
commit 6af9e6fa76
parent ab61a502e4
1 changed files with 114 additions and 0 deletions
--- a/training/research/hogwild-convergence.md
+++ b/training/research/hogwild-convergence.md
@ -0,0 +1,114 @@
+# HOGWILD! Convergence Theory
+
+Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
+Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730)
+
+## The Setup
+
+Multiple processors read and write to shared memory containing model
+parameters, with NO locks, NO synchronization. Each processor:
+1. Reads the current parameter vector (possibly stale)
+2. Samples a training example
+3. Computes a gradient
+4. Writes the update directly to shared memory
+
+Other processors may have modified the parameters between steps 1 and 4.
+The reads are inconsistent (reading a mix of old and new values). The
+writes may overwrite each other.
+
+## Why It Works
+
+### Sparsity condition
+
+The key assumption: each gradient update only modifies a small fraction
+of the total parameters. Formally, if the gradient of example e has
+support set s(e) (the non-zero entries), then the updates are sparse
+if |s(e)| << total parameters.
+
+When updates are sparse, the probability of two concurrent updates
+modifying the same parameter is low. Most of the time, the processors
+are writing to different parameters, so no information is lost.
+
+### Our case: even better than HOGWILD
+
+HOGWILD assumes multiple writers competing. We have:
+- **One writer** (the training process applying Apollo updates)
+- **One reader** (vLLM running inference)
+
+This is strictly easier than HOGWILD's multi-writer scenario. The only
+risk is the reader seeing a partially-updated parameter tensor during
+inference. But:
+
+1. GPU writes are coalesced — a CUDA kernel writes to parameter memory
+   in large contiguous blocks, not one element at a time
+2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1))
+3. A partially-applied update is within rounding error of either the
+   old or new value
+4. One slightly noisy token in a conversation is invisible
+
+### Convergence guarantee
+
+Under the sparsity condition and standard smoothness assumptions
+(Lipschitz continuous gradients), HOGWILD achieves convergence rate:
+
+```
+E[f(x_t) - f*] ≤ O(1 / (ε·t))
+```
+
+where t is the number of updates and ε measures the conflict rate.
+With sparse updates and few processors, ε ≈ 1 and the rate matches
+standard SGD.
+
+The "nearly optimal" means the convergence rate is within a constant
+factor of serial SGD — the parallelism costs almost nothing in terms
+of convergence quality.
+
+## Practical implications for our system
+
+### 1. No pause needed
+
+vLLM inference reads weights while Apollo writes them. The convergence
+guarantee holds because our gradient updates are sparse (each training
+example only generates meaningful gradients for a small fraction of
+the 27B parameters).
+
+### 2. No lock needed
+
+Not even a mutex. The worst case (vLLM reads mid-write) produces one
+token from a partially-updated model. The next token uses the fully-
+updated model. There is no accumulation of error — each inference step
+is independent.
+
+### 3. Training can be continuous
+
+Not batched nightly, not time-multiplexed. The Apollo daemon can
+process training examples as they arrive, updating weights
+continuously. vLLM never pauses, never notices.
+
+### 4. The sparsity comes for free
+
+Context-frozen training on short decision segments (50-256 tokens)
+naturally produces sparse gradients. Most weights are near-zero
+gradient because only a small fraction of the model is "active" for
+any given short sequence. This satisfies HOGWILD's sparsity condition
+without any special engineering.
+
+## Connection to Apollo
+
+Apollo's channel-wise scaling further helps convergence in the
+concurrent setting. Because the scaling factors are computed per
+channel (not per element), the update pattern has additional structure
+that reduces the effective conflict rate between the training writer
+and the inference reader.
+
+The norm-growth limiter (γ=1.01) also helps — it prevents any single
+training step from making a large change that could temporarily
+destabilize inference. Each update is bounded to be at most 1% larger
+than the previous one.
+
+## References
+
+- Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
+  Stochastic Gradient Descent" (NIPS 2011)
+- The term "HOGWILD" comes from running SGD "hog wild" — completely
+  uncontrolled parallelism