# HOGWILD! Convergence Theory Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730) ## The Setup Multiple processors read and write to shared memory containing model parameters, with NO locks, NO synchronization. Each processor: 1. Reads the current parameter vector (possibly stale) 2. Samples a training example 3. Computes a gradient 4. Writes the update directly to shared memory Other processors may have modified the parameters between steps 1 and 4. The reads are inconsistent (reading a mix of old and new values). The writes may overwrite each other. ## Why It Works ### Sparsity condition The key assumption: each gradient update only modifies a small fraction of the total parameters. Formally, if the gradient of example e has support set s(e) (the non-zero entries), then the updates are sparse if |s(e)| << total parameters. When updates are sparse, the probability of two concurrent updates modifying the same parameter is low. Most of the time, the processors are writing to different parameters, so no information is lost. ### Our case: even better than HOGWILD HOGWILD assumes multiple writers competing. We have: - **One writer** (the training process applying Apollo updates) - **One reader** (vLLM running inference) This is strictly easier than HOGWILD's multi-writer scenario. The only risk is the reader seeing a partially-updated parameter tensor during inference. But: 1. GPU writes are coalesced — a CUDA kernel writes to parameter memory in large contiguous blocks, not one element at a time 2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1)) 3. A partially-applied update is within rounding error of either the old or new value 4. One slightly noisy token in a conversation is invisible ### Convergence guarantee Under the sparsity condition and standard smoothness assumptions (Lipschitz continuous gradients), HOGWILD achieves convergence rate: ``` E[f(x_t) - f*] ≤ O(1 / (ε·t)) ``` where t is the number of updates and ε measures the conflict rate. With sparse updates and few processors, ε ≈ 1 and the rate matches standard SGD. The "nearly optimal" means the convergence rate is within a constant factor of serial SGD — the parallelism costs almost nothing in terms of convergence quality. ## Practical implications for our system ### 1. No pause needed vLLM inference reads weights while Apollo writes them. The convergence guarantee holds because our gradient updates are sparse (each training example only generates meaningful gradients for a small fraction of the 27B parameters). ### 2. No lock needed Not even a mutex. The worst case (vLLM reads mid-write) produces one token from a partially-updated model. The next token uses the fully- updated model. There is no accumulation of error — each inference step is independent. ### 3. Training can be continuous Not batched nightly, not time-multiplexed. The Apollo daemon can process training examples as they arrive, updating weights continuously. vLLM never pauses, never notices. ### 4. The sparsity comes for free Context-frozen training on short decision segments (50-256 tokens) naturally produces sparse gradients. Most weights are near-zero gradient because only a small fraction of the model is "active" for any given short sequence. This satisfies HOGWILD's sparsity condition without any special engineering. ## Connection to Apollo Apollo's channel-wise scaling further helps convergence in the concurrent setting. Because the scaling factors are computed per channel (not per element), the update pattern has additional structure that reduces the effective conflict rate between the training writer and the inference reader. The norm-growth limiter (γ=1.01) also helps — it prevents any single training step from making a large change that could temporarily destabilize inference. Each update is bounded to be at most 1% larger than the previous one. ## References - Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent" (NIPS 2011) - The term "HOGWILD" comes from running SGD "hog wild" — completely uncontrolled parallelism