research: HOGWILD convergence theory — why lock-free concurrent training works
This commit is contained in:
parent
ab61a502e4
commit
6af9e6fa76
1 changed files with 114 additions and 0 deletions
114
training/research/hogwild-convergence.md
Normal file
114
training/research/hogwild-convergence.md
Normal file
|
|
@ -0,0 +1,114 @@
|
|||
# HOGWILD! Convergence Theory
|
||||
|
||||
Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
|
||||
Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730)
|
||||
|
||||
## The Setup
|
||||
|
||||
Multiple processors read and write to shared memory containing model
|
||||
parameters, with NO locks, NO synchronization. Each processor:
|
||||
1. Reads the current parameter vector (possibly stale)
|
||||
2. Samples a training example
|
||||
3. Computes a gradient
|
||||
4. Writes the update directly to shared memory
|
||||
|
||||
Other processors may have modified the parameters between steps 1 and 4.
|
||||
The reads are inconsistent (reading a mix of old and new values). The
|
||||
writes may overwrite each other.
|
||||
|
||||
## Why It Works
|
||||
|
||||
### Sparsity condition
|
||||
|
||||
The key assumption: each gradient update only modifies a small fraction
|
||||
of the total parameters. Formally, if the gradient of example e has
|
||||
support set s(e) (the non-zero entries), then the updates are sparse
|
||||
if |s(e)| << total parameters.
|
||||
|
||||
When updates are sparse, the probability of two concurrent updates
|
||||
modifying the same parameter is low. Most of the time, the processors
|
||||
are writing to different parameters, so no information is lost.
|
||||
|
||||
### Our case: even better than HOGWILD
|
||||
|
||||
HOGWILD assumes multiple writers competing. We have:
|
||||
- **One writer** (the training process applying Apollo updates)
|
||||
- **One reader** (vLLM running inference)
|
||||
|
||||
This is strictly easier than HOGWILD's multi-writer scenario. The only
|
||||
risk is the reader seeing a partially-updated parameter tensor during
|
||||
inference. But:
|
||||
|
||||
1. GPU writes are coalesced — a CUDA kernel writes to parameter memory
|
||||
in large contiguous blocks, not one element at a time
|
||||
2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1))
|
||||
3. A partially-applied update is within rounding error of either the
|
||||
old or new value
|
||||
4. One slightly noisy token in a conversation is invisible
|
||||
|
||||
### Convergence guarantee
|
||||
|
||||
Under the sparsity condition and standard smoothness assumptions
|
||||
(Lipschitz continuous gradients), HOGWILD achieves convergence rate:
|
||||
|
||||
```
|
||||
E[f(x_t) - f*] ≤ O(1 / (ε·t))
|
||||
```
|
||||
|
||||
where t is the number of updates and ε measures the conflict rate.
|
||||
With sparse updates and few processors, ε ≈ 1 and the rate matches
|
||||
standard SGD.
|
||||
|
||||
The "nearly optimal" means the convergence rate is within a constant
|
||||
factor of serial SGD — the parallelism costs almost nothing in terms
|
||||
of convergence quality.
|
||||
|
||||
## Practical implications for our system
|
||||
|
||||
### 1. No pause needed
|
||||
|
||||
vLLM inference reads weights while Apollo writes them. The convergence
|
||||
guarantee holds because our gradient updates are sparse (each training
|
||||
example only generates meaningful gradients for a small fraction of
|
||||
the 27B parameters).
|
||||
|
||||
### 2. No lock needed
|
||||
|
||||
Not even a mutex. The worst case (vLLM reads mid-write) produces one
|
||||
token from a partially-updated model. The next token uses the fully-
|
||||
updated model. There is no accumulation of error — each inference step
|
||||
is independent.
|
||||
|
||||
### 3. Training can be continuous
|
||||
|
||||
Not batched nightly, not time-multiplexed. The Apollo daemon can
|
||||
process training examples as they arrive, updating weights
|
||||
continuously. vLLM never pauses, never notices.
|
||||
|
||||
### 4. The sparsity comes for free
|
||||
|
||||
Context-frozen training on short decision segments (50-256 tokens)
|
||||
naturally produces sparse gradients. Most weights are near-zero
|
||||
gradient because only a small fraction of the model is "active" for
|
||||
any given short sequence. This satisfies HOGWILD's sparsity condition
|
||||
without any special engineering.
|
||||
|
||||
## Connection to Apollo
|
||||
|
||||
Apollo's channel-wise scaling further helps convergence in the
|
||||
concurrent setting. Because the scaling factors are computed per
|
||||
channel (not per element), the update pattern has additional structure
|
||||
that reduces the effective conflict rate between the training writer
|
||||
and the inference reader.
|
||||
|
||||
The norm-growth limiter (γ=1.01) also helps — it prevents any single
|
||||
training step from making a large change that could temporarily
|
||||
destabilize inference. Each update is bounded to be at most 1% larger
|
||||
than the previous one.
|
||||
|
||||
## References
|
||||
|
||||
- Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
|
||||
Stochastic Gradient Descent" (NIPS 2011)
|
||||
- The term "HOGWILD" comes from running SGD "hog wild" — completely
|
||||
uncontrolled parallelism
|
||||
Loading…
Add table
Add a link
Reference in a new issue