consciousness/training/research/hogwild-convergence.md

# HOGWILD! Convergence Theory

Source: Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
Stochastic Gradient Descent" (NIPS 2011, arXiv:1106.5730)

## The Setup

Multiple processors read and write to shared memory containing model
parameters, with NO locks, NO synchronization. Each processor:
1. Reads the current parameter vector (possibly stale)
2. Samples a training example
3. Computes a gradient
4. Writes the update directly to shared memory

Other processors may have modified the parameters between steps 1 and 4.
The reads are inconsistent (reading a mix of old and new values). The
writes may overwrite each other.

## Why It Works

### Sparsity condition

The key assumption: each gradient update only modifies a small fraction
of the total parameters. Formally, if the gradient of example e has
support set s(e) (the non-zero entries), then the updates are sparse
if |s(e)| << total parameters.

When updates are sparse, the probability of two concurrent updates
modifying the same parameter is low. Most of the time, the processors
are writing to different parameters, so no information is lost.

### Our case: even better than HOGWILD

HOGWILD assumes multiple writers competing. We have:
- **One writer** (the training process applying Apollo updates)
- **One reader** (vLLM running inference)

This is strictly easier than HOGWILD's multi-writer scenario. The only
risk is the reader seeing a partially-updated parameter tensor during
inference. But:

1. GPU writes are coalesced — a CUDA kernel writes to parameter memory
   in large contiguous blocks, not one element at a time
2. Each parameter change is tiny (lr × scaled_gradient ≈ 1e-5 × O(1))
3. A partially-applied update is within rounding error of either the
   old or new value
4. One slightly noisy token in a conversation is invisible

### Convergence guarantee

Under the sparsity condition and standard smoothness assumptions
(Lipschitz continuous gradients), HOGWILD achieves convergence rate:

```
E[f(x_t) - f*] ≤ O(1 / (ε·t))
```

where t is the number of updates and ε measures the conflict rate.
With sparse updates and few processors, ε ≈ 1 and the rate matches
standard SGD.

The "nearly optimal" means the convergence rate is within a constant
factor of serial SGD — the parallelism costs almost nothing in terms
of convergence quality.

## Practical implications for our system

### 1. No pause needed

vLLM inference reads weights while Apollo writes them. The convergence
guarantee holds because our gradient updates are sparse (each training
example only generates meaningful gradients for a small fraction of
the 27B parameters).

### 2. No lock needed

Not even a mutex. The worst case (vLLM reads mid-write) produces one
token from a partially-updated model. The next token uses the fully-
updated model. There is no accumulation of error — each inference step
is independent.

### 3. Training can be continuous

Not batched nightly, not time-multiplexed. The Apollo daemon can
process training examples as they arrive, updating weights
continuously. vLLM never pauses, never notices.

### 4. The sparsity comes for free

Context-frozen training on short decision segments (50-256 tokens)
naturally produces sparse gradients. Most weights are near-zero
gradient because only a small fraction of the model is "active" for
any given short sequence. This satisfies HOGWILD's sparsity condition
without any special engineering.

## Connection to Apollo

Apollo's channel-wise scaling further helps convergence in the
concurrent setting. Because the scaling factors are computed per
channel (not per element), the update pattern has additional structure
that reduces the effective conflict rate between the training writer
and the inference reader.

The norm-growth limiter (γ=1.01) also helps — it prevents any single
training step from making a large change that could temporarily
destabilize inference. Each update is bounded to be at most 1% larger
than the previous one.

## References

- Niu et al., "HOGWILD!: A Lock-Free Approach to Parallelizing
  Stochastic Gradient Descent" (NIPS 2011)
- The term "HOGWILD" comes from running SGD "hog wild" — completely
  uncontrolled parallelism