training: use rank 64, define as single constant
- DEFAULT_RANK = 64 in train_router.py - All references use the constant, not magic numbers - ~2.5GB optimizer state instead of ~10GB Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
parent
039473d31f
commit
68a2df2185
3 changed files with 16 additions and 16 deletions
|
|
@ -3,7 +3,7 @@
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference.
|
Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference.
|
||||||
Full-weight updates (not LoRA) using Apollo optimizer with rank-256
|
Full-weight updates (not LoRA) using Apollo optimizer with rank-64
|
||||||
gradient projection. No pause required — HOGWILD concurrent training.
|
gradient projection. No pause required — HOGWILD concurrent training.
|
||||||
Weights shared via CUDA IPC between vLLM and the training process.
|
Weights shared via CUDA IPC between vLLM and the training process.
|
||||||
|
|
||||||
|
|
@ -63,10 +63,9 @@ LoRA trains adapter matrices, not base weights. For personality and
|
||||||
behavioral changes that persist as disposition, the base weights
|
behavioral changes that persist as disposition, the base weights
|
||||||
need to change. Apollo makes this memory-feasible.
|
need to change. Apollo makes this memory-feasible.
|
||||||
|
|
||||||
### Rank 256
|
### Rank 64
|
||||||
Not Mini (rank-1). With 100+ diverse training examples, the
|
Not Mini (rank-1). Rank-64 captures gradient structure across diverse
|
||||||
gradient's effective dimensionality can reach hundreds. Rank-256
|
training examples while keeping memory low (~2.5GB on 27B model).
|
||||||
captures the structure. Memory cost: ~10GB (negligible on B200).
|
|
||||||
Compute cost: <0.25% of forward+backward.
|
Compute cost: <0.25% of forward+backward.
|
||||||
|
|
||||||
### Channel-wise scaling
|
### Channel-wise scaling
|
||||||
|
|
@ -94,7 +93,7 @@ from a per-parameter seed each step.
|
||||||
### Parameter grouping (Qwen3.5 gotcha)
|
### Parameter grouping (Qwen3.5 gotcha)
|
||||||
conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector
|
conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector
|
||||||
needs 2D matrices with min dimension >= rank. Small/3D tensors
|
needs 2D matrices with min dimension >= rank. Small/3D tensors
|
||||||
use standard Adam. Large 2D matrices use Apollo with rank-256.
|
use standard Adam. Large 2D matrices use Apollo.
|
||||||
|
|
||||||
## Training Data Pipeline
|
## Training Data Pipeline
|
||||||
|
|
||||||
|
|
@ -229,7 +228,7 @@ a few hundred MB.
|
||||||
|
|
||||||
| State | Location | Notes |
|
| State | Location | Notes |
|
||||||
|-------|----------|-------|
|
|-------|----------|-------|
|
||||||
| Apollo optimizer | train_router._optimizer | ~10GB for rank-256. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync. |
|
| Apollo optimizer | train_router._optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync. |
|
||||||
| HF model with vLLM views | train_router._model | Lazy-loaded on first /train. Parameters point to vLLM's GPU memory. |
|
| HF model with vLLM views | train_router._model | Lazy-loaded on first /train. Parameters point to vLLM's GPU memory. |
|
||||||
|
|
||||||
## Hyperparameters
|
## Hyperparameters
|
||||||
|
|
@ -237,7 +236,7 @@ a few hundred MB.
|
||||||
| Parameter | Value | Rationale |
|
| Parameter | Value | Rationale |
|
||||||
|-----------|-------|-----------|
|
|-----------|-------|-----------|
|
||||||
| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
|
| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
|
||||||
| Rank | 256 | Captures gradient structure across 100+ examples. ~10GB state. |
|
| Rank | 64 | Captures gradient structure. ~2.5GB state. Defined in `train_router.DEFAULT_RANK`. |
|
||||||
| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
|
| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
|
||||||
| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
|
| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
|
||||||
| Batch size | 1 | Single examples, immediate updates. |
|
| Batch size | 1 | Single examples, immediate updates. |
|
||||||
|
|
@ -248,7 +247,7 @@ a few hundred MB.
|
||||||
## Components
|
## Components
|
||||||
|
|
||||||
### Built ✓
|
### Built ✓
|
||||||
- `optimizer.py` — Apollo optimizer (configurable rank, default 256)
|
- `optimizer.py` — Apollo optimizer (configurable rank)
|
||||||
- `train_router.py` — /train endpoint, runs in vLLM process
|
- `train_router.py` — /train endpoint, runs in vLLM process
|
||||||
- `weight_mapping.py` — vLLM merged → HF separate views (validated)
|
- `weight_mapping.py` — vLLM merged → HF separate views (validated)
|
||||||
- `export_hook.py` — vLLM plugin hook for IPC handle export
|
- `export_hook.py` — vLLM plugin hook for IPC handle export
|
||||||
|
|
|
||||||
|
|
@ -8,9 +8,9 @@ Channel-wise or tensor-wise scaling is sufficient. Apollo approximates
|
||||||
these scaling factors using a low-rank auxiliary optimizer state based on
|
these scaling factors using a low-rank auxiliary optimizer state based on
|
||||||
pure random projection.
|
pure random projection.
|
||||||
|
|
||||||
Default rank=256 (full Apollo). ~10GB state for 27B model, <0.25%
|
Default rank=64. ~2.5GB state for 27B model, <0.25% compute overhead
|
||||||
compute overhead vs forward+backward. Captures gradient structure
|
vs forward+backward. Sufficient for behavioral training with diverse
|
||||||
across 100+ behavioral training examples per batch.
|
examples.
|
||||||
|
|
||||||
Key implementation details from the paper:
|
Key implementation details from the paper:
|
||||||
- Gradient scale factor α = √(n/r) compensates for projection ratio
|
- Gradient scale factor α = √(n/r) compensates for projection ratio
|
||||||
|
|
@ -34,7 +34,7 @@ class Apollo(Optimizer):
|
||||||
Args:
|
Args:
|
||||||
params: model parameters
|
params: model parameters
|
||||||
lr: learning rate (default: 1e-4)
|
lr: learning rate (default: 1e-4)
|
||||||
rank: projection rank (default: 256)
|
rank: projection rank (default: 64)
|
||||||
betas: Adam momentum coefficients (default: (0.9, 0.999))
|
betas: Adam momentum coefficients (default: (0.9, 0.999))
|
||||||
eps: numerical stability term (default: 1e-8)
|
eps: numerical stability term (default: 1e-8)
|
||||||
weight_decay: decoupled weight decay (default: 0.01)
|
weight_decay: decoupled weight decay (default: 0.01)
|
||||||
|
|
@ -46,7 +46,7 @@ class Apollo(Optimizer):
|
||||||
Set to None to disable.
|
Set to None to disable.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, params, lr=1e-4, rank=256, betas=(0.9, 0.999),
|
def __init__(self, params, lr=1e-4, rank=64, betas=(0.9, 0.999),
|
||||||
eps=1e-8, weight_decay=0.01, warmup_steps=0,
|
eps=1e-8, weight_decay=0.01, warmup_steps=0,
|
||||||
scale=None, proj_refresh=200, norm_growth_limit=1.01):
|
scale=None, proj_refresh=200, norm_growth_limit=1.01):
|
||||||
defaults = dict(lr=lr, rank=rank, betas=betas, eps=eps,
|
defaults = dict(lr=lr, rank=rank, betas=betas, eps=eps,
|
||||||
|
|
|
||||||
|
|
@ -42,6 +42,7 @@ _initialized: bool = False
|
||||||
_optimizer: Any = None # Persisted Apollo optimizer
|
_optimizer: Any = None # Persisted Apollo optimizer
|
||||||
|
|
||||||
OPTIMIZER_STATE_PATH = "/tmp/apollo_optimizer_state.pt"
|
OPTIMIZER_STATE_PATH = "/tmp/apollo_optimizer_state.pt"
|
||||||
|
DEFAULT_RANK = 64
|
||||||
|
|
||||||
|
|
||||||
def _load_training_model() -> nn.Module:
|
def _load_training_model() -> nn.Module:
|
||||||
|
|
@ -150,7 +151,7 @@ def _get_or_create_optimizer(model: nn.Module, config: dict[str, Any]):
|
||||||
apollo_params, standard_params = [], []
|
apollo_params, standard_params = [], []
|
||||||
for p in model.parameters():
|
for p in model.parameters():
|
||||||
if p.requires_grad:
|
if p.requires_grad:
|
||||||
if p.ndim >= 2 and min(p.shape) >= 256:
|
if p.ndim >= 2 and min(p.shape) >= DEFAULT_RANK:
|
||||||
apollo_params.append(p)
|
apollo_params.append(p)
|
||||||
else:
|
else:
|
||||||
standard_params.append(p)
|
standard_params.append(p)
|
||||||
|
|
@ -168,7 +169,7 @@ def _get_or_create_optimizer(model: nn.Module, config: dict[str, Any]):
|
||||||
_optimizer = Apollo(
|
_optimizer = Apollo(
|
||||||
groups,
|
groups,
|
||||||
lr=config.get('lr', 1e-5),
|
lr=config.get('lr', 1e-5),
|
||||||
rank=config.get('rank', 256),
|
rank=config.get('rank', DEFAULT_RANK),
|
||||||
betas=tuple(config.get('betas', (0.9, 0.999))),
|
betas=tuple(config.get('betas', (0.9, 0.999))),
|
||||||
eps=config.get('eps', 1e-8),
|
eps=config.get('eps', 1e-8),
|
||||||
weight_decay=config.get('weight_decay', 0.01),
|
weight_decay=config.get('weight_decay', 0.01),
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue