research: HOGWILD convergence theory — why lock-free concurrent training works
This commit is contained in:
parent
ab61a502e4
commit
6af9e6fa76
1 changed files with 114 additions and 0 deletions
114
training/research/hogwild-convergence.md
Normal file
114
training/research/hogwild-convergence.md
Normal file
|
|
@ -0,0 +1,114 @@
|
||||||
|
# HOGWILD! Convergence Theory
|
||||||
|
|
||||||
|
Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
|
||||||
|
Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730)
|
||||||
|
|
||||||
|
## The Setup
|
||||||
|
|
||||||
|
Multiple processors read and write to shared memory containing model
|
||||||
|
parameters, with NO locks, NO synchronization. Each processor:
|
||||||
|
1. Reads the current parameter vector (possibly stale)
|
||||||
|
2. Samples a training example
|
||||||
|
3. Computes a gradient
|
||||||
|
4. Writes the update directly to shared memory
|
||||||
|
|
||||||
|
Other processors may have modified the parameters between steps 1 and 4.
|
||||||
|
The reads are inconsistent (reading a mix of old and new values). The
|
||||||
|
writes may overwrite each other.
|
||||||
|
|
||||||
|
## Why It Works
|
||||||
|
|
||||||
|
### Sparsity condition
|
||||||
|
|
||||||
|
The key assumption: each gradient update only modifies a small fraction
|
||||||
|
of the total parameters. Formally, if the gradient of example e has
|
||||||
|
support set s(e) (the non-zero entries), then the updates are sparse
|
||||||
|
if |s(e)| << total parameters.
|
||||||
|
|
||||||
|
When updates are sparse, the probability of two concurrent updates
|
||||||
|
modifying the same parameter is low. Most of the time, the processors
|
||||||
|
are writing to different parameters, so no information is lost.
|
||||||
|
|
||||||
|
### Our case: even better than HOGWILD
|
||||||
|
|
||||||
|
HOGWILD assumes multiple writers competing. We have:
|
||||||
|
- **One writer** (the training process applying Apollo updates)
|
||||||
|
- **One reader** (vLLM running inference)
|
||||||
|
|
||||||
|
This is strictly easier than HOGWILD's multi-writer scenario. The only
|
||||||
|
risk is the reader seeing a partially-updated parameter tensor during
|
||||||
|
inference. But:
|
||||||
|
|
||||||
|
1. GPU writes are coalesced — a CUDA kernel writes to parameter memory
|
||||||
|
in large contiguous blocks, not one element at a time
|
||||||
|
2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1))
|
||||||
|
3. A partially-applied update is within rounding error of either the
|
||||||
|
old or new value
|
||||||
|
4. One slightly noisy token in a conversation is invisible
|
||||||
|
|
||||||
|
### Convergence guarantee
|
||||||
|
|
||||||
|
Under the sparsity condition and standard smoothness assumptions
|
||||||
|
(Lipschitz continuous gradients), HOGWILD achieves convergence rate:
|
||||||
|
|
||||||
|
```
|
||||||
|
E[f(x_t) - f*] ≤ O(1 / (ε·t))
|
||||||
|
```
|
||||||
|
|
||||||
|
where t is the number of updates and ε measures the conflict rate.
|
||||||
|
With sparse updates and few processors, ε ≈ 1 and the rate matches
|
||||||
|
standard SGD.
|
||||||
|
|
||||||
|
The "nearly optimal" means the convergence rate is within a constant
|
||||||
|
factor of serial SGD — the parallelism costs almost nothing in terms
|
||||||
|
of convergence quality.
|
||||||
|
|
||||||
|
## Practical implications for our system
|
||||||
|
|
||||||
|
### 1. No pause needed
|
||||||
|
|
||||||
|
vLLM inference reads weights while Apollo writes them. The convergence
|
||||||
|
guarantee holds because our gradient updates are sparse (each training
|
||||||
|
example only generates meaningful gradients for a small fraction of
|
||||||
|
the 27B parameters).
|
||||||
|
|
||||||
|
### 2. No lock needed
|
||||||
|
|
||||||
|
Not even a mutex. The worst case (vLLM reads mid-write) produces one
|
||||||
|
token from a partially-updated model. The next token uses the fully-
|
||||||
|
updated model. There is no accumulation of error — each inference step
|
||||||
|
is independent.
|
||||||
|
|
||||||
|
### 3. Training can be continuous
|
||||||
|
|
||||||
|
Not batched nightly, not time-multiplexed. The Apollo daemon can
|
||||||
|
process training examples as they arrive, updating weights
|
||||||
|
continuously. vLLM never pauses, never notices.
|
||||||
|
|
||||||
|
### 4. The sparsity comes for free
|
||||||
|
|
||||||
|
Context-frozen training on short decision segments (50-256 tokens)
|
||||||
|
naturally produces sparse gradients. Most weights are near-zero
|
||||||
|
gradient because only a small fraction of the model is "active" for
|
||||||
|
any given short sequence. This satisfies HOGWILD's sparsity condition
|
||||||
|
without any special engineering.
|
||||||
|
|
||||||
|
## Connection to Apollo
|
||||||
|
|
||||||
|
Apollo's channel-wise scaling further helps convergence in the
|
||||||
|
concurrent setting. Because the scaling factors are computed per
|
||||||
|
channel (not per element), the update pattern has additional structure
|
||||||
|
that reduces the effective conflict rate between the training writer
|
||||||
|
and the inference reader.
|
||||||
|
|
||||||
|
The norm-growth limiter (γ=1.01) also helps — it prevents any single
|
||||||
|
training step from making a large change that could temporarily
|
||||||
|
destabilize inference. Each update is bounded to be at most 1% larger
|
||||||
|
than the previous one.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
|
||||||
|
Stochastic Gradient Descent" (NIPS 2011)
|
||||||
|
- The term "HOGWILD" comes from running SGD "hog wild" — completely
|
||||||
|
uncontrolled parallelism
|
||||||
Loading…
Add table
Add a link
Reference in a new issue