consciousness/training/research/hogwild-convergence.md

114 lines
4.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HOGWILD! Convergence Theory
Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730)
## The Setup
Multiple processors read and write to shared memory containing model
parameters, with NO locks, NO synchronization. Each processor:
1. Reads the current parameter vector (possibly stale)
2. Samples a training example
3. Computes a gradient
4. Writes the update directly to shared memory
Other processors may have modified the parameters between steps 1 and 4.
The reads are inconsistent (reading a mix of old and new values). The
writes may overwrite each other.
## Why It Works
### Sparsity condition
The key assumption: each gradient update only modifies a small fraction
of the total parameters. Formally, if the gradient of example e has
support set s(e) (the non-zero entries), then the updates are sparse
if |s(e)| << total parameters.
When updates are sparse, the probability of two concurrent updates
modifying the same parameter is low. Most of the time, the processors
are writing to different parameters, so no information is lost.
### Our case: even better than HOGWILD
HOGWILD assumes multiple writers competing. We have:
- **One writer** (the training process applying Apollo updates)
- **One reader** (vLLM running inference)
This is strictly easier than HOGWILD's multi-writer scenario. The only
risk is the reader seeing a partially-updated parameter tensor during
inference. But:
1. GPU writes are coalesced a CUDA kernel writes to parameter memory
in large contiguous blocks, not one element at a time
2. Each parameter change is tiny (lr × scaled_gradient 1e-5 × O(1))
3. A partially-applied update is within rounding error of either the
old or new value
4. One slightly noisy token in a conversation is invisible
### Convergence guarantee
Under the sparsity condition and standard smoothness assumptions
(Lipschitz continuous gradients), HOGWILD achieves convergence rate:
```
E[f(x_t) - f*] ≤ O(1 / (ε·t))
```
where t is the number of updates and ε measures the conflict rate.
With sparse updates and few processors, ε 1 and the rate matches
standard SGD.
The "nearly optimal" means the convergence rate is within a constant
factor of serial SGD the parallelism costs almost nothing in terms
of convergence quality.
## Practical implications for our system
### 1. No pause needed
vLLM inference reads weights while Apollo writes them. The convergence
guarantee holds because our gradient updates are sparse (each training
example only generates meaningful gradients for a small fraction of
the 27B parameters).
### 2. No lock needed
Not even a mutex. The worst case (vLLM reads mid-write) produces one
token from a partially-updated model. The next token uses the fully-
updated model. There is no accumulation of error each inference step
is independent.
### 3. Training can be continuous
Not batched nightly, not time-multiplexed. The Apollo daemon can
process training examples as they arrive, updating weights
continuously. vLLM never pauses, never notices.
### 4. The sparsity comes for free
Context-frozen training on short decision segments (50-256 tokens)
naturally produces sparse gradients. Most weights are near-zero
gradient because only a small fraction of the model is "active" for
any given short sequence. This satisfies HOGWILD's sparsity condition
without any special engineering.
## Connection to Apollo
Apollo's channel-wise scaling further helps convergence in the
concurrent setting. Because the scaling factors are computed per
channel (not per element), the update pattern has additional structure
that reduces the effective conflict rate between the training writer
and the inference reader.
The norm-growth limiter (γ=1.01) also helps it prevents any single
training step from making a large change that could temporarily
destabilize inference. Each update is bounded to be at most 1% larger
than the previous one.
## References
- Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
Stochastic Gradient Descent" (NIPS 2011)
- The term "HOGWILD" comes from running SGD "hog wild" completely
uncontrolled parallelism