training: use rank 64, define as single constant

- DEFAULT_RANK = 64 in train_router.py
- All references use the constant, not magic numbers
- ~2.5GB optimizer state instead of ~10GB

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
Kent Overstreet 2026-04-16 00:53:48 -04:00
parent 039473d31f
commit 68a2df2185
3 changed files with 16 additions and 16 deletions

View file

@ -3,7 +3,7 @@
## Overview
Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference.
Full-weight updates (not LoRA) using Apollo optimizer with rank-256
Full-weight updates (not LoRA) using Apollo optimizer with rank-64
gradient projection. No pause required — HOGWILD concurrent training.
Weights shared via CUDA IPC between vLLM and the training process.
@ -63,10 +63,9 @@ LoRA trains adapter matrices, not base weights. For personality and
behavioral changes that persist as disposition, the base weights
need to change. Apollo makes this memory-feasible.
### Rank 256
Not Mini (rank-1). With 100+ diverse training examples, the
gradient's effective dimensionality can reach hundreds. Rank-256
captures the structure. Memory cost: ~10GB (negligible on B200).
### Rank 64
Not Mini (rank-1). Rank-64 captures gradient structure across diverse
training examples while keeping memory low (~2.5GB on 27B model).
Compute cost: <0.25% of forward+backward.
### Channel-wise scaling
@ -94,7 +93,7 @@ from a per-parameter seed each step.
### Parameter grouping (Qwen3.5 gotcha)
conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector
needs 2D matrices with min dimension >= rank. Small/3D tensors
use standard Adam. Large 2D matrices use Apollo with rank-256.
use standard Adam. Large 2D matrices use Apollo.
## Training Data Pipeline
@ -229,7 +228,7 @@ a few hundred MB.
| State | Location | Notes |
|-------|----------|-------|
| Apollo optimizer | train_router._optimizer | ~10GB for rank-256. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync. |
| Apollo optimizer | train_router._optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync. |
| HF model with vLLM views | train_router._model | Lazy-loaded on first /train. Parameters point to vLLM's GPU memory. |
## Hyperparameters
@ -237,7 +236,7 @@ a few hundred MB.
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
| Rank | 256 | Captures gradient structure across 100+ examples. ~10GB state. |
| Rank | 64 | Captures gradient structure. ~2.5GB state. Defined in `train_router.DEFAULT_RANK`. |
| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
| Batch size | 1 | Single examples, immediate updates. |
@ -248,7 +247,7 @@ a few hundred MB.
## Components
### Built ✓
- `optimizer.py` — Apollo optimizer (configurable rank, default 256)
- `optimizer.py` — Apollo optimizer (configurable rank)
- `train_router.py` — /train endpoint, runs in vLLM process
- `weight_mapping.py` — vLLM merged → HF separate views (validated)
- `export_hook.py` — vLLM plugin hook for IPC handle export