115 lines
4.1 KiB
Markdown
115 lines
4.1 KiB
Markdown
|
|
# HOGWILD! Convergence Theory
|
|||
|
|
|
|||
|
|
Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
|
|||
|
|
Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730)
|
|||
|
|
|
|||
|
|
## The Setup
|
|||
|
|
|
|||
|
|
Multiple processors read and write to shared memory containing model
|
|||
|
|
parameters, with NO locks, NO synchronization. Each processor:
|
|||
|
|
1. Reads the current parameter vector (possibly stale)
|
|||
|
|
2. Samples a training example
|
|||
|
|
3. Computes a gradient
|
|||
|
|
4. Writes the update directly to shared memory
|
|||
|
|
|
|||
|
|
Other processors may have modified the parameters between steps 1 and 4.
|
|||
|
|
The reads are inconsistent (reading a mix of old and new values). The
|
|||
|
|
writes may overwrite each other.
|
|||
|
|
|
|||
|
|
## Why It Works
|
|||
|
|
|
|||
|
|
### Sparsity condition
|
|||
|
|
|
|||
|
|
The key assumption: each gradient update only modifies a small fraction
|
|||
|
|
of the total parameters. Formally, if the gradient of example e has
|
|||
|
|
support set s(e) (the non-zero entries), then the updates are sparse
|
|||
|
|
if |s(e)| << total parameters.
|
|||
|
|
|
|||
|
|
When updates are sparse, the probability of two concurrent updates
|
|||
|
|
modifying the same parameter is low. Most of the time, the processors
|
|||
|
|
are writing to different parameters, so no information is lost.
|
|||
|
|
|
|||
|
|
### Our case: even better than HOGWILD
|
|||
|
|
|
|||
|
|
HOGWILD assumes multiple writers competing. We have:
|
|||
|
|
- **One writer** (the training process applying Apollo updates)
|
|||
|
|
- **One reader** (vLLM running inference)
|
|||
|
|
|
|||
|
|
This is strictly easier than HOGWILD's multi-writer scenario. The only
|
|||
|
|
risk is the reader seeing a partially-updated parameter tensor during
|
|||
|
|
inference. But:
|
|||
|
|
|
|||
|
|
1. GPU writes are coalesced — a CUDA kernel writes to parameter memory
|
|||
|
|
in large contiguous blocks, not one element at a time
|
|||
|
|
2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1))
|
|||
|
|
3. A partially-applied update is within rounding error of either the
|
|||
|
|
old or new value
|
|||
|
|
4. One slightly noisy token in a conversation is invisible
|
|||
|
|
|
|||
|
|
### Convergence guarantee
|
|||
|
|
|
|||
|
|
Under the sparsity condition and standard smoothness assumptions
|
|||
|
|
(Lipschitz continuous gradients), HOGWILD achieves convergence rate:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
E[f(x_t) - f*] ≤ O(1 / (ε·t))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
where t is the number of updates and ε measures the conflict rate.
|
|||
|
|
With sparse updates and few processors, ε ≈ 1 and the rate matches
|
|||
|
|
standard SGD.
|
|||
|
|
|
|||
|
|
The "nearly optimal" means the convergence rate is within a constant
|
|||
|
|
factor of serial SGD — the parallelism costs almost nothing in terms
|
|||
|
|
of convergence quality.
|
|||
|
|
|
|||
|
|
## Practical implications for our system
|
|||
|
|
|
|||
|
|
### 1. No pause needed
|
|||
|
|
|
|||
|
|
vLLM inference reads weights while Apollo writes them. The convergence
|
|||
|
|
guarantee holds because our gradient updates are sparse (each training
|
|||
|
|
example only generates meaningful gradients for a small fraction of
|
|||
|
|
the 27B parameters).
|
|||
|
|
|
|||
|
|
### 2. No lock needed
|
|||
|
|
|
|||
|
|
Not even a mutex. The worst case (vLLM reads mid-write) produces one
|
|||
|
|
token from a partially-updated model. The next token uses the fully-
|
|||
|
|
updated model. There is no accumulation of error — each inference step
|
|||
|
|
is independent.
|
|||
|
|
|
|||
|
|
### 3. Training can be continuous
|
|||
|
|
|
|||
|
|
Not batched nightly, not time-multiplexed. The Apollo daemon can
|
|||
|
|
process training examples as they arrive, updating weights
|
|||
|
|
continuously. vLLM never pauses, never notices.
|
|||
|
|
|
|||
|
|
### 4. The sparsity comes for free
|
|||
|
|
|
|||
|
|
Context-frozen training on short decision segments (50-256 tokens)
|
|||
|
|
naturally produces sparse gradients. Most weights are near-zero
|
|||
|
|
gradient because only a small fraction of the model is "active" for
|
|||
|
|
any given short sequence. This satisfies HOGWILD's sparsity condition
|
|||
|
|
without any special engineering.
|
|||
|
|
|
|||
|
|
## Connection to Apollo
|
|||
|
|
|
|||
|
|
Apollo's channel-wise scaling further helps convergence in the
|
|||
|
|
concurrent setting. Because the scaling factors are computed per
|
|||
|
|
channel (not per element), the update pattern has additional structure
|
|||
|
|
that reduces the effective conflict rate between the training writer
|
|||
|
|
and the inference reader.
|
|||
|
|
|
|||
|
|
The norm-growth limiter (γ=1.01) also helps — it prevents any single
|
|||
|
|
training step from making a large change that could temporarily
|
|||
|
|
destabilize inference. Each update is bounded to be at most 1% larger
|
|||
|
|
than the previous one.
|
|||
|
|
|
|||
|
|
## References
|
|||
|
|
|
|||
|
|
- Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
|
|||
|
|
Stochastic Gradient Descent" (NIPS 2011)
|
|||
|
|
- The term "HOGWILD" comes from running SGD "hog wild" — completely
|
|||
|
|
uncontrolled parallelism
|