114 lines
4.1 KiB
Markdown
114 lines
4.1 KiB
Markdown
# HOGWILD! Convergence Theory
|
||
|
||
Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
|
||
Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730)
|
||
|
||
## The Setup
|
||
|
||
Multiple processors read and write to shared memory containing model
|
||
parameters, with NO locks, NO synchronization. Each processor:
|
||
1. Reads the current parameter vector (possibly stale)
|
||
2. Samples a training example
|
||
3. Computes a gradient
|
||
4. Writes the update directly to shared memory
|
||
|
||
Other processors may have modified the parameters between steps 1 and 4.
|
||
The reads are inconsistent (reading a mix of old and new values). The
|
||
writes may overwrite each other.
|
||
|
||
## Why It Works
|
||
|
||
### Sparsity condition
|
||
|
||
The key assumption: each gradient update only modifies a small fraction
|
||
of the total parameters. Formally, if the gradient of example e has
|
||
support set s(e) (the non-zero entries), then the updates are sparse
|
||
if |s(e)| << total parameters.
|
||
|
||
When updates are sparse, the probability of two concurrent updates
|
||
modifying the same parameter is low. Most of the time, the processors
|
||
are writing to different parameters, so no information is lost.
|
||
|
||
### Our case: even better than HOGWILD
|
||
|
||
HOGWILD assumes multiple writers competing. We have:
|
||
- **One writer** (the training process applying Apollo updates)
|
||
- **One reader** (vLLM running inference)
|
||
|
||
This is strictly easier than HOGWILD's multi-writer scenario. The only
|
||
risk is the reader seeing a partially-updated parameter tensor during
|
||
inference. But:
|
||
|
||
1. GPU writes are coalesced — a CUDA kernel writes to parameter memory
|
||
in large contiguous blocks, not one element at a time
|
||
2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1))
|
||
3. A partially-applied update is within rounding error of either the
|
||
old or new value
|
||
4. One slightly noisy token in a conversation is invisible
|
||
|
||
### Convergence guarantee
|
||
|
||
Under the sparsity condition and standard smoothness assumptions
|
||
(Lipschitz continuous gradients), HOGWILD achieves convergence rate:
|
||
|
||
```
|
||
E[f(x_t) - f*] ≤ O(1 / (ε·t))
|
||
```
|
||
|
||
where t is the number of updates and ε measures the conflict rate.
|
||
With sparse updates and few processors, ε ≈ 1 and the rate matches
|
||
standard SGD.
|
||
|
||
The "nearly optimal" means the convergence rate is within a constant
|
||
factor of serial SGD — the parallelism costs almost nothing in terms
|
||
of convergence quality.
|
||
|
||
## Practical implications for our system
|
||
|
||
### 1. No pause needed
|
||
|
||
vLLM inference reads weights while Apollo writes them. The convergence
|
||
guarantee holds because our gradient updates are sparse (each training
|
||
example only generates meaningful gradients for a small fraction of
|
||
the 27B parameters).
|
||
|
||
### 2. No lock needed
|
||
|
||
Not even a mutex. The worst case (vLLM reads mid-write) produces one
|
||
token from a partially-updated model. The next token uses the fully-
|
||
updated model. There is no accumulation of error — each inference step
|
||
is independent.
|
||
|
||
### 3. Training can be continuous
|
||
|
||
Not batched nightly, not time-multiplexed. The Apollo daemon can
|
||
process training examples as they arrive, updating weights
|
||
continuously. vLLM never pauses, never notices.
|
||
|
||
### 4. The sparsity comes for free
|
||
|
||
Context-frozen training on short decision segments (50-256 tokens)
|
||
naturally produces sparse gradients. Most weights are near-zero
|
||
gradient because only a small fraction of the model is "active" for
|
||
any given short sequence. This satisfies HOGWILD's sparsity condition
|
||
without any special engineering.
|
||
|
||
## Connection to Apollo
|
||
|
||
Apollo's channel-wise scaling further helps convergence in the
|
||
concurrent setting. Because the scaling factors are computed per
|
||
channel (not per element), the update pattern has additional structure
|
||
that reduces the effective conflict rate between the training writer
|
||
and the inference reader.
|
||
|
||
The norm-growth limiter (γ=1.01) also helps — it prevents any single
|
||
training step from making a large change that could temporarily
|
||
destabilize inference. Each update is bounded to be at most 1% larger
|
||
than the previous one.
|
||
|
||
## References
|
||
|
||
- Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
|
||
Stochastic Gradient Descent" (NIPS 2011)
|
||
- The term "HOGWILD" comes from running SGD "hog wild" — completely
|
||
uncontrolled parallelism
|