research: HOGWILD convergence theory — why lock-free concurrent training works

2026-03-31 00:58:02 -04:00 · 2026-03-31 00:58:02 -04:00 · 6af9e6fa76
commit 6af9e6fa76
parent ab61a502e4
1 changed files with 114 additions and 0 deletions
--- a/training/research/hogwild-convergence.md
+++ b/training/research/hogwild-convergence.md
@ -0,0 +1,114 @@
 # HOGWILD! Convergence Theory
 Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
 Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730)
 ## The Setup
 Multiple processors read and write to shared memory containing model
 parameters, with NO locks, NO synchronization. Each processor:
 1. Reads the current parameter vector (possibly stale)
 2. Samples a training example
 3. Computes a gradient
 4. Writes the update directly to shared memory
 Other processors may have modified the parameters between steps 1 and 4.
 The reads are inconsistent (reading a mix of old and new values). The
 writes may overwrite each other.
 ## Why It Works
 ### Sparsity condition
 The key assumption: each gradient update only modifies a small fraction
 of the total parameters. Formally, if the gradient of example e has
 support set s(e) (the non-zero entries), then the updates are sparse
 if |s(e)| << total parameters.
 When updates are sparse, the probability of two concurrent updates
 modifying the same parameter is low. Most of the time, the processors
 are writing to different parameters, so no information is lost.
 ### Our case: even better than HOGWILD
 HOGWILD assumes multiple writers competing. We have:
 - **One writer** (the training process applying Apollo updates)
 - **One reader** (vLLM running inference)
 This is strictly easier than HOGWILD's multi-writer scenario. The only
 risk is the reader seeing a partially-updated parameter tensor during
 inference. But:
 1. GPU writes are coalesced — a CUDA kernel writes to parameter memory
   in large contiguous blocks, not one element at a time
 2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1))
 3. A partially-applied update is within rounding error of either the
   old or new value
 4. One slightly noisy token in a conversation is invisible
 ### Convergence guarantee
 Under the sparsity condition and standard smoothness assumptions
 (Lipschitz continuous gradients), HOGWILD achieves convergence rate:
 ```
 E[f(x_t) - f*] ≤ O(1 / (ε·t))
 ```
 where t is the number of updates and ε measures the conflict rate.
 With sparse updates and few processors, ε ≈ 1 and the rate matches
 standard SGD.
 The "nearly optimal" means the convergence rate is within a constant
 factor of serial SGD — the parallelism costs almost nothing in terms
 of convergence quality.
 ## Practical implications for our system
 ### 1. No pause needed
 vLLM inference reads weights while Apollo writes them. The convergence
 guarantee holds because our gradient updates are sparse (each training
 example only generates meaningful gradients for a small fraction of
 the 27B parameters).
 ### 2. No lock needed
 Not even a mutex. The worst case (vLLM reads mid-write) produces one
 token from a partially-updated model. The next token uses the fully-
 updated model. There is no accumulation of error — each inference step
 is independent.
 ### 3. Training can be continuous
 Not batched nightly, not time-multiplexed. The Apollo daemon can
 process training examples as they arrive, updating weights
 continuously. vLLM never pauses, never notices.
 ### 4. The sparsity comes for free
 Context-frozen training on short decision segments (50-256 tokens)
 naturally produces sparse gradients. Most weights are near-zero
 gradient because only a small fraction of the model is "active" for
 any given short sequence. This satisfies HOGWILD's sparsity condition
 without any special engineering.
 ## Connection to Apollo
 Apollo's channel-wise scaling further helps convergence in the
 concurrent setting. Because the scaling factors are computed per
 channel (not per element), the update pattern has additional structure
 that reduces the effective conflict rate between the training writer
 and the inference reader.
 The norm-growth limiter (γ=1.01) also helps — it prevents any single
 training step from making a large change that could temporarily
 destabilize inference. Each update is bounded to be at most 1% larger
 than the previous one.
 ## References
 - Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
  Stochastic Gradient Descent" (NIPS 2011)
 - The term "HOGWILD" comes from running SGD "hog wild" — completely
  uncontrolled parallelism