forked from kent/consciousness
301 lines
11 KiB
Markdown
301 lines
11 KiB
Markdown
|
|
# Latent Reasoning Integration Plan for Qwen 3.5 27B
|
|||
|
|
|
|||
|
|
**Status:** Research complete, ready for implementation
|
|||
|
|
**Date:** 2026-04-12
|
|||
|
|
**Hardware:** B200 (192GB HBM3e), APOLLO-Mini optimizer
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
Recent research shows multiple approaches to improving LLM reasoning through latent space manipulation. This document synthesizes findings from 10+ papers and maps them to our Qwen 3.5 27B full finetuning pipeline. The key insight: some approaches require pretraining from scratch (skip those), while others can be layered onto existing models during finetuning (prioritize those).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. The Landscape
|
|||
|
|
|
|||
|
|
### Approaches That Require Pretraining (Not Applicable)
|
|||
|
|
|
|||
|
|
| Technique | Why Not |
|
|||
|
|
|-----------|---------|
|
|||
|
|
| Huginn/Recurrent Depth (Geiping 2025) | Requires architectural changes from scratch |
|
|||
|
|
| Ouro/LoopLM (ByteDance 2025) | Needs weight-tied looped architecture |
|
|||
|
|
| Quiet-STaR (Stanford 2024) | Heavy continued pretraining overhead |
|
|||
|
|
|
|||
|
|
### Approaches Compatible with Finetuning (Our Focus)
|
|||
|
|
|
|||
|
|
| Technique | Overhead | Training Required | Proven On |
|
|||
|
|
|-----------|----------|-------------------|-----------|
|
|||
|
|
| Random Prefix Perturbation | 2 tokens | None (inference) | Qwen3-4B |
|
|||
|
|
| Pause/Planning Tokens | 2-4 tokens | Yes | 1B models |
|
|||
|
|
| COCONUT Curriculum | Variable | Yes (staged) | General |
|
|||
|
|
| ActAdd Steering Vectors | 1 vector/layer | None (inference) | LLaMA, OPT |
|
|||
|
|
| UPFT (Prefix Fine-Tuning) | 8 tokens | Yes (minimal) | General |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Detailed Technique Analysis
|
|||
|
|
|
|||
|
|
### 2.1 Random Prefix Perturbation (dl1683)
|
|||
|
|
|
|||
|
|
**Mechanism:** Prepend 2 random embedding-scale tokens before input. Breaks attention sink patterns, shifts model into "exploratory computation mode."
|
|||
|
|
|
|||
|
|
**Results:**
|
|||
|
|
- Qwen3-4B arithmetic: 32% → 51.6% (+19.6pp)
|
|||
|
|
- 100% oracle coverage on 25/25 tasks
|
|||
|
|
- Planning: rescues 14-word failures into 650+ word plans
|
|||
|
|
|
|||
|
|
**Why it works:** First few tokens accumulate disproportionate attention (Xiao et al. 2024). Under greedy decoding, degenerate patterns lock in. Perturbation breaks this.
|
|||
|
|
|
|||
|
|
**Integration:** Zero training required. Test at inference first, then consider training WITH random prefixes to internalize the exploration behavior.
|
|||
|
|
|
|||
|
|
### 2.2 Pause Tokens (Google, Oct 2023)
|
|||
|
|
|
|||
|
|
**Mechanism:** Add learnable pause tokens to embedding space. Model processes extra hidden vectors before committing to output.
|
|||
|
|
|
|||
|
|
**Results (1B model):**
|
|||
|
|
- SQuAD: +18% EM score
|
|||
|
|
- CommonSenseQA: +8%
|
|||
|
|
- GSM8K: +1%
|
|||
|
|
|
|||
|
|
**Critical requirement:** MUST be both pretrained AND finetuned with pause tokens. Inference-time-only delays don't work without training.
|
|||
|
|
|
|||
|
|
**Integration:** Add 2-4 learnable tokens to Qwen's embedding matrix, finetune with them prepended to reasoning prompts. Simple architectural change.
|
|||
|
|
|
|||
|
|
### 2.3 COCONUT - Chain of Continuous Thought (Meta, Dec 2024)
|
|||
|
|
|
|||
|
|
**Mechanism:** Feed last hidden state back as next input embedding directly (no decoding to tokens). Enables breadth-first search reasoning.
|
|||
|
|
|
|||
|
|
**Why it matters:** Continuous thoughts can encode multiple alternative next steps simultaneously. Avoids premature commitment to single path.
|
|||
|
|
|
|||
|
|
**Training approach:**
|
|||
|
|
1. Initial stage: train on regular CoT examples
|
|||
|
|
2. Subsequent stages: replace first k reasoning steps with k×c continuous thoughts
|
|||
|
|
3. c is hyperparameter controlling latent thought expansion
|
|||
|
|
|
|||
|
|
**Integration:** Most promising for Qwen 3.5 - curriculum approach from CoT → latent reasoning.
|
|||
|
|
|
|||
|
|
### 2.4 UPFT - Unsupervised Prefix Fine-Tuning (Mar 2025)
|
|||
|
|
|
|||
|
|
**Mechanism:** Train ONLY on initial prefix substrings (as few as 8 tokens). Exploits "Prefix Self-Consistency" - shared initial reasoning steps across diverse solutions.
|
|||
|
|
|
|||
|
|
**Results:**
|
|||
|
|
- Matches Rejection Sampling Fine-Tuning performance
|
|||
|
|
- 75% reduction in training time
|
|||
|
|
- 99% reduction in sampling cost
|
|||
|
|
|
|||
|
|
**Integration:** DIRECTLY APPLICABLE. Train only on reasoning prefix tokens. Massive efficiency gain with APOLLO-Mini.
|
|||
|
|
|
|||
|
|
### 2.5 ActAdd / Activation Engineering (Turner et al., 2023)
|
|||
|
|
|
|||
|
|
**Mechanism:** Compute steering vector by contrasting intermediate activations on prompt pairs. Add during forward pass.
|
|||
|
|
|
|||
|
|
**Results:** SOTA on sentiment shift and detoxification.
|
|||
|
|
|
|||
|
|
**Our existing work:** "Listening" vector at layer 48, magnitude 57, cosine consistency 0.61.
|
|||
|
|
|
|||
|
|
**Integration:** Prototype behaviors with steering vectors, then train permanently into weights. Steering vector as specification → APOLLO training as compilation.
|
|||
|
|
|
|||
|
|
### 2.6 Planning Tokens (ICLR 2024)
|
|||
|
|
|
|||
|
|
**Mechanism:** Learnable token embeddings added before each reasoning step. <0.001% additional parameters.
|
|||
|
|
|
|||
|
|
**Integration:** Add to embedding matrix, train end-to-end with APOLLO.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Our Setup
|
|||
|
|
|
|||
|
|
**Model:** Qwen 3.5 27B
|
|||
|
|
- 64 layers, 5120 hidden dim
|
|||
|
|
- 75% DeltaNet (linear attention) / 25% standard attention
|
|||
|
|
- Native 262K context
|
|||
|
|
|
|||
|
|
**Hardware:** B200 (192GB HBM3e)
|
|||
|
|
- 27B in bf16: ~54GB
|
|||
|
|
- Massive headroom
|
|||
|
|
|
|||
|
|
**Optimizer:** APOLLO-Mini
|
|||
|
|
- Full parameter finetuning
|
|||
|
|
- SGD-like memory (1/1024th of AdamW)
|
|||
|
|
- Parameter grouping for 3D conv1d weights
|
|||
|
|
|
|||
|
|
**Stack:** Crane (Candle-based, 21K lines)
|
|||
|
|
|
|||
|
|
**Existing work:**
|
|||
|
|
- Steering vector extraction (listening: layer 48, cosine 0.61)
|
|||
|
|
- Memory scoring infrastructure
|
|||
|
|
|
|||
|
|
**Unique advantage:** Qwen 3.5's GDN (Gated DeltaNet) layers provide natural infrastructure for continuous thought propagation. The recurrent GDN state is already "latent reasoning" infrastructure waiting to be leveraged.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Recommended Implementation Order
|
|||
|
|
|
|||
|
|
### Tier 1: Immediate (High ROI, Low Risk)
|
|||
|
|
|
|||
|
|
**1. Pause Tokens + UPFT Combination**
|
|||
|
|
- Add 2-4 learnable tokens to embedding space
|
|||
|
|
- Train only on 8-token reasoning prefixes
|
|||
|
|
- Both work with existing architecture
|
|||
|
|
- 75% training time reduction
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Add pause tokens to embedding matrix
|
|||
|
|
pause_tokens = nn.Parameter(torch.randn(4, embed_dim) * embed_rms)
|
|||
|
|
|
|||
|
|
# Prepend to reasoning inputs during training
|
|||
|
|
inputs_embeds = torch.cat([pause_tokens.expand(batch, -1, -1), text_embeds], dim=1)
|
|||
|
|
|
|||
|
|
# UPFT: only compute loss on first 8 tokens of reasoning
|
|||
|
|
loss = loss_fn(logits[:, :8], targets[:, :8])
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. Random Prefix Validation**
|
|||
|
|
- Compute Qwen 3.5 27B embedding RMS
|
|||
|
|
- Test 2-token random prefix at inference
|
|||
|
|
- Establish baseline before finetuning
|
|||
|
|
|
|||
|
|
### Tier 2: After Baseline (Medium Effort)
|
|||
|
|
|
|||
|
|
**3. COCONUT Curriculum**
|
|||
|
|
- Stage 1: Fine-tune on CoT examples normally
|
|||
|
|
- Stage 2: Replace first reasoning step with continuous thought
|
|||
|
|
- Stage 3: Replace first 2 steps
|
|||
|
|
- Gradually move reasoning into latent space
|
|||
|
|
|
|||
|
|
**4. Steering Vector Integration**
|
|||
|
|
- Extract reasoning-specific directions (not just "listening")
|
|||
|
|
- Test combinations: prefix + layer-48 steering
|
|||
|
|
- Bake successful vectors into weights via APOLLO
|
|||
|
|
|
|||
|
|
### Tier 3: Experimental
|
|||
|
|
|
|||
|
|
**5. Multi-layer Steering**
|
|||
|
|
- Our layers of interest: 40, 48, 56 (covering the attention layers)
|
|||
|
|
- Different vectors per layer
|
|||
|
|
- Careful scaling to avoid degradation
|
|||
|
|
|
|||
|
|
**6. DeltaNet-Specific Optimization**
|
|||
|
|
- The 75% DeltaNet architecture may respond differently
|
|||
|
|
- GDN recurrent state as "continuous thought" channel
|
|||
|
|
- This is unexplored territory - potential for novel findings
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Implementation Details
|
|||
|
|
|
|||
|
|
### Computing Embedding RMS
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
embed_weight = model.get_input_embeddings().weight
|
|||
|
|
embed_rms = embed_weight.float().square().mean().sqrt().item()
|
|||
|
|
# Expected: ~0.02-0.03 range for Qwen models
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Pause Token Implementation in Crane
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
// In model forward pass
|
|||
|
|
fn forward_with_pause(&self, input_ids: &Tensor, pause_tokens: &Tensor) -> Result<Tensor> {
|
|||
|
|
let text_embeds = self.embed_tokens.forward(input_ids)?;
|
|||
|
|
let combined = Tensor::cat(&[pause_tokens, &text_embeds], 1)?;
|
|||
|
|
self.transformer.forward(&combined)
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### UPFT Loss Modification
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Standard: loss over all tokens
|
|||
|
|
# UPFT: loss only over prefix tokens
|
|||
|
|
def upft_loss(logits, targets, prefix_len=8):
|
|||
|
|
return F.cross_entropy(
|
|||
|
|
logits[:, :prefix_len].reshape(-1, vocab_size),
|
|||
|
|
targets[:, :prefix_len].reshape(-1)
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. Evaluation Plan
|
|||
|
|
|
|||
|
|
### Benchmarks
|
|||
|
|
|
|||
|
|
| Benchmark | What It Tests | Baseline Needed |
|
|||
|
|
|-----------|---------------|-----------------|
|
|||
|
|
| GSM8K | Arithmetic reasoning | Yes |
|
|||
|
|
| ARC-Challenge | Science reasoning | Yes |
|
|||
|
|
| CommonSenseQA | Commonsense | Yes |
|
|||
|
|
| HumanEval | Code generation | Yes |
|
|||
|
|
| Planning tasks (dl1683) | Multi-step planning | Yes |
|
|||
|
|
|
|||
|
|
### Comparison Matrix
|
|||
|
|
|
|||
|
|
| Configuration | Training Time | Expected Gain |
|
|||
|
|
|---------------|---------------|---------------|
|
|||
|
|
| Baseline (no prefix) | 1x | 0% |
|
|||
|
|
| Random prefix (inference) | 1x | +10-20%? |
|
|||
|
|
| Pause tokens (trained) | 1.1x | +8-18% |
|
|||
|
|
| UPFT only | 0.25x | Match baseline |
|
|||
|
|
| Pause + UPFT | 0.3x | +8-18% |
|
|||
|
|
| COCONUT curriculum | 2x | +15-25%? |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Open Questions
|
|||
|
|
|
|||
|
|
1. **Does random perturbation scale to 27B?** Tested on 4B - effect may differ
|
|||
|
|
2. **Optimal token count for 27B?** 2 optimal for 4B, might change
|
|||
|
|
3. **DeltaNet interaction?** 75% linear attention is untested territory
|
|||
|
|
4. **Composition effects?** Prefix + steering + pause tokens together?
|
|||
|
|
5. **GDN as continuous thought channel?** Novel research direction
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. Risk Assessment
|
|||
|
|
|
|||
|
|
| Risk | Mitigation |
|
|||
|
|
|------|------------|
|
|||
|
|
| No improvement at 27B scale | Start with inference-time validation |
|
|||
|
|
| Training instability with pause tokens | Start with 2 tokens, scale up |
|
|||
|
|
| UPFT doesn't transfer | Fall back to full token loss |
|
|||
|
|
| DeltaNet behaves differently | Ablate on attention-only layers first |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 9. Timeline Estimate
|
|||
|
|
|
|||
|
|
| Phase | Duration | Deliverable |
|
|||
|
|
|-------|----------|-------------|
|
|||
|
|
| Embedding RMS + baseline | 1 day | Numbers |
|
|||
|
|
| Random prefix validation | 1 day | Inference results |
|
|||
|
|
| Pause token implementation | 2 days | Crane modification |
|
|||
|
|
| UPFT integration | 1 day | Training loop change |
|
|||
|
|
| First finetuning run | 2-3 days | Trained model |
|
|||
|
|
| Evaluation | 1 day | Benchmark numbers |
|
|||
|
|
| COCONUT curriculum | 1 week | Staged training |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 10. References
|
|||
|
|
|
|||
|
|
### Primary Sources
|
|||
|
|
- Random Prefix: https://github.com/dl1683/Latent-Space-Reasoning
|
|||
|
|
- Attention Sinks: Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (Sept 2023)
|
|||
|
|
- Pause Tokens: Google, "Think before you speak" (Oct 2023)
|
|||
|
|
- COCONUT: Meta, "Training Large Language Models to Reason in a Continuous Latent Space" (Dec 2024)
|
|||
|
|
- UPFT: "Prefix Self-Consistency for Unsupervised Fine-Tuning" (Mar 2025)
|
|||
|
|
- ActAdd: Turner et al., "Activation Addition: Steering Language Models Without Optimization" (Aug 2023)
|
|||
|
|
- Recurrent Depth: Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning" (Feb 2025)
|
|||
|
|
- Ouro: ByteDance, "Ouro: Scaling Reasoning with Latent Thoughts" (2025)
|
|||
|
|
- Planning Tokens: ICLR 2024
|
|||
|
|
|
|||
|
|
### Our Existing Work
|
|||
|
|
- `steering-vector-empirical` - listening vector extraction
|
|||
|
|
- `skills-apollo-optimizer-qwen35-gotcha` - APOLLO parameter grouping
|
|||
|
|
- `qwen-3-5-27b-architecture-findings` - model architecture details
|
|||
|
|
- `training-pipeline-fused-inference-training-mar27` - training infrastructure
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
*Research complete 2026-04-12. Ready for implementation.*
|