From 6af9e6fa76c6a65480e0e8a8642c59fee6f7dd31 Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 00:58:02 -0400 Subject: [PATCH] =?UTF-8?q?research:=20HOGWILD=20convergence=20theory=20?= =?UTF-8?q?=E2=80=94=20why=20lock-free=20concurrent=20training=20works?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- training/research/hogwild-convergence.md | 114 +++++++++++++++++++++++ 1 file changed, 114 insertions(+) create mode 100644 training/research/hogwild-convergence.md diff --git a/training/research/hogwild-convergence.md b/training/research/hogwild-convergence.md new file mode 100644 index 0000000..568dd13 --- /dev/null +++ b/training/research/hogwild-convergence.md @@ -0,0 +1,114 @@ +# HOGWILD! Convergence Theory + +Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing +Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730) + +## The Setup + +Multiple processors read and write to shared memory containing model +parameters, with NO locks, NO synchronization. Each processor: +1. Reads the current parameter vector (possibly stale) +2. Samples a training example +3. Computes a gradient +4. Writes the update directly to shared memory + +Other processors may have modified the parameters between steps 1 and 4. +The reads are inconsistent (reading a mix of old and new values). The +writes may overwrite each other. + +## Why It Works + +### Sparsity condition + +The key assumption: each gradient update only modifies a small fraction +of the total parameters. Formally, if the gradient of example e has +support set s(e) (the non-zero entries), then the updates are sparse +if |s(e)| << total parameters. + +When updates are sparse, the probability of two concurrent updates +modifying the same parameter is low. Most of the time, the processors +are writing to different parameters, so no information is lost. + +### Our case: even better than HOGWILD + +HOGWILD assumes multiple writers competing. We have: +- **One writer** (the training process applying Apollo updates) +- **One reader** (vLLM running inference) + +This is strictly easier than HOGWILD's multi-writer scenario. The only +risk is the reader seeing a partially-updated parameter tensor during +inference. But: + +1. GPU writes are coalesced — a CUDA kernel writes to parameter memory + in large contiguous blocks, not one element at a time +2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1)) +3. A partially-applied update is within rounding error of either the + old or new value +4. One slightly noisy token in a conversation is invisible + +### Convergence guarantee + +Under the sparsity condition and standard smoothness assumptions +(Lipschitz continuous gradients), HOGWILD achieves convergence rate: + +``` +E[f(x_t) - f*] ≤ O(1 / (ε·t)) +``` + +where t is the number of updates and ε measures the conflict rate. +With sparse updates and few processors, ε ≈ 1 and the rate matches +standard SGD. + +The "nearly optimal" means the convergence rate is within a constant +factor of serial SGD — the parallelism costs almost nothing in terms +of convergence quality. + +## Practical implications for our system + +### 1. No pause needed + +vLLM inference reads weights while Apollo writes them. The convergence +guarantee holds because our gradient updates are sparse (each training +example only generates meaningful gradients for a small fraction of +the 27B parameters). + +### 2. No lock needed + +Not even a mutex. The worst case (vLLM reads mid-write) produces one +token from a partially-updated model. The next token uses the fully- +updated model. There is no accumulation of error — each inference step +is independent. + +### 3. Training can be continuous + +Not batched nightly, not time-multiplexed. The Apollo daemon can +process training examples as they arrive, updating weights +continuously. vLLM never pauses, never notices. + +### 4. The sparsity comes for free + +Context-frozen training on short decision segments (50-256 tokens) +naturally produces sparse gradients. Most weights are near-zero +gradient because only a small fraction of the model is "active" for +any given short sequence. This satisfies HOGWILD's sparsity condition +without any special engineering. + +## Connection to Apollo + +Apollo's channel-wise scaling further helps convergence in the +concurrent setting. Because the scaling factors are computed per +channel (not per element), the update pattern has additional structure +that reduces the effective conflict rate between the training writer +and the inference reader. + +The norm-growth limiter (γ=1.01) also helps — it prevents any single +training step from making a large change that could temporarily +destabilize inference. Each update is bounded to be at most 1% larger +than the previous one. + +## References + +- Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing + Stochastic Gradient Descent" (NIPS 2011) +- The term "HOGWILD" comes from running SGD "hog wild" — completely + uncontrolled parallelism