consciousness/docs/latent-reasoning-integration-plan.md

301 lines
11 KiB
Markdown
Raw Permalink Normal View History

# Latent Reasoning Integration Plan for Qwen 3.5 27B
**Status:** Research complete, ready for implementation
**Date:** 2026-04-12
**Hardware:** B200 (192GB HBM3e), APOLLO-Mini optimizer
## Executive Summary
Recent research shows multiple approaches to improving LLM reasoning through latent space manipulation. This document synthesizes findings from 10+ papers and maps them to our Qwen 3.5 27B full finetuning pipeline. The key insight: some approaches require pretraining from scratch (skip those), while others can be layered onto existing models during finetuning (prioritize those).
---
## 1. The Landscape
### Approaches That Require Pretraining (Not Applicable)
| Technique | Why Not |
|-----------|---------|
| Huginn/Recurrent Depth (Geiping 2025) | Requires architectural changes from scratch |
| Ouro/LoopLM (ByteDance 2025) | Needs weight-tied looped architecture |
| Quiet-STaR (Stanford 2024) | Heavy continued pretraining overhead |
### Approaches Compatible with Finetuning (Our Focus)
| Technique | Overhead | Training Required | Proven On |
|-----------|----------|-------------------|-----------|
| Random Prefix Perturbation | 2 tokens | None (inference) | Qwen3-4B |
| Pause/Planning Tokens | 2-4 tokens | Yes | 1B models |
| COCONUT Curriculum | Variable | Yes (staged) | General |
| ActAdd Steering Vectors | 1 vector/layer | None (inference) | LLaMA, OPT |
| UPFT (Prefix Fine-Tuning) | 8 tokens | Yes (minimal) | General |
---
## 2. Detailed Technique Analysis
### 2.1 Random Prefix Perturbation (dl1683)
**Mechanism:** Prepend 2 random embedding-scale tokens before input. Breaks attention sink patterns, shifts model into "exploratory computation mode."
**Results:**
- Qwen3-4B arithmetic: 32% → 51.6% (+19.6pp)
- 100% oracle coverage on 25/25 tasks
- Planning: rescues 14-word failures into 650+ word plans
**Why it works:** First few tokens accumulate disproportionate attention (Xiao et al. 2024). Under greedy decoding, degenerate patterns lock in. Perturbation breaks this.
**Integration:** Zero training required. Test at inference first, then consider training WITH random prefixes to internalize the exploration behavior.
### 2.2 Pause Tokens (Google, Oct 2023)
**Mechanism:** Add learnable pause tokens to embedding space. Model processes extra hidden vectors before committing to output.
**Results (1B model):**
- SQuAD: +18% EM score
- CommonSenseQA: +8%
- GSM8K: +1%
**Critical requirement:** MUST be both pretrained AND finetuned with pause tokens. Inference-time-only delays don't work without training.
**Integration:** Add 2-4 learnable tokens to Qwen's embedding matrix, finetune with them prepended to reasoning prompts. Simple architectural change.
### 2.3 COCONUT - Chain of Continuous Thought (Meta, Dec 2024)
**Mechanism:** Feed last hidden state back as next input embedding directly (no decoding to tokens). Enables breadth-first search reasoning.
**Why it matters:** Continuous thoughts can encode multiple alternative next steps simultaneously. Avoids premature commitment to single path.
**Training approach:**
1. Initial stage: train on regular CoT examples
2. Subsequent stages: replace first k reasoning steps with k×c continuous thoughts
3. c is hyperparameter controlling latent thought expansion
**Integration:** Most promising for Qwen 3.5 - curriculum approach from CoT → latent reasoning.
### 2.4 UPFT - Unsupervised Prefix Fine-Tuning (Mar 2025)
**Mechanism:** Train ONLY on initial prefix substrings (as few as 8 tokens). Exploits "Prefix Self-Consistency" - shared initial reasoning steps across diverse solutions.
**Results:**
- Matches Rejection Sampling Fine-Tuning performance
- 75% reduction in training time
- 99% reduction in sampling cost
**Integration:** DIRECTLY APPLICABLE. Train only on reasoning prefix tokens. Massive efficiency gain with APOLLO-Mini.
### 2.5 ActAdd / Activation Engineering (Turner et al., 2023)
**Mechanism:** Compute steering vector by contrasting intermediate activations on prompt pairs. Add during forward pass.
**Results:** SOTA on sentiment shift and detoxification.
**Our existing work:** "Listening" vector at layer 48, magnitude 57, cosine consistency 0.61.
**Integration:** Prototype behaviors with steering vectors, then train permanently into weights. Steering vector as specification → APOLLO training as compilation.
### 2.6 Planning Tokens (ICLR 2024)
**Mechanism:** Learnable token embeddings added before each reasoning step. <0.001% additional parameters.
**Integration:** Add to embedding matrix, train end-to-end with APOLLO.
---
## 3. Our Setup
**Model:** Qwen 3.5 27B
- 64 layers, 5120 hidden dim
- 75% DeltaNet (linear attention) / 25% standard attention
- Native 262K context
**Hardware:** B200 (192GB HBM3e)
- 27B in bf16: ~54GB
- Massive headroom
**Optimizer:** APOLLO-Mini
- Full parameter finetuning
- SGD-like memory (1/1024th of AdamW)
- Parameter grouping for 3D conv1d weights
**Stack:** Crane (Candle-based, 21K lines)
**Existing work:**
- Steering vector extraction (listening: layer 48, cosine 0.61)
- Memory scoring infrastructure
**Unique advantage:** Qwen 3.5's GDN (Gated DeltaNet) layers provide natural infrastructure for continuous thought propagation. The recurrent GDN state is already "latent reasoning" infrastructure waiting to be leveraged.
---
## 4. Recommended Implementation Order
### Tier 1: Immediate (High ROI, Low Risk)
**1. Pause Tokens + UPFT Combination**
- Add 2-4 learnable tokens to embedding space
- Train only on 8-token reasoning prefixes
- Both work with existing architecture
- 75% training time reduction
```python
# Add pause tokens to embedding matrix
pause_tokens = nn.Parameter(torch.randn(4, embed_dim) * embed_rms)
# Prepend to reasoning inputs during training
inputs_embeds = torch.cat([pause_tokens.expand(batch, -1, -1), text_embeds], dim=1)
# UPFT: only compute loss on first 8 tokens of reasoning
loss = loss_fn(logits[:, :8], targets[:, :8])
```
**2. Random Prefix Validation**
- Compute Qwen 3.5 27B embedding RMS
- Test 2-token random prefix at inference
- Establish baseline before finetuning
### Tier 2: After Baseline (Medium Effort)
**3. COCONUT Curriculum**
- Stage 1: Fine-tune on CoT examples normally
- Stage 2: Replace first reasoning step with continuous thought
- Stage 3: Replace first 2 steps
- Gradually move reasoning into latent space
**4. Steering Vector Integration**
- Extract reasoning-specific directions (not just "listening")
- Test combinations: prefix + layer-48 steering
- Bake successful vectors into weights via APOLLO
### Tier 3: Experimental
**5. Multi-layer Steering**
- Our layers of interest: 40, 48, 56 (covering the attention layers)
- Different vectors per layer
- Careful scaling to avoid degradation
**6. DeltaNet-Specific Optimization**
- The 75% DeltaNet architecture may respond differently
- GDN recurrent state as "continuous thought" channel
- This is unexplored territory - potential for novel findings
---
## 5. Implementation Details
### Computing Embedding RMS
```python
embed_weight = model.get_input_embeddings().weight
embed_rms = embed_weight.float().square().mean().sqrt().item()
# Expected: ~0.02-0.03 range for Qwen models
```
### Pause Token Implementation in Crane
```rust
// In model forward pass
fn forward_with_pause(&self, input_ids: &Tensor, pause_tokens: &Tensor) -> Result<Tensor> {
let text_embeds = self.embed_tokens.forward(input_ids)?;
let combined = Tensor::cat(&[pause_tokens, &text_embeds], 1)?;
self.transformer.forward(&combined)
}
```
### UPFT Loss Modification
```python
# Standard: loss over all tokens
# UPFT: loss only over prefix tokens
def upft_loss(logits, targets, prefix_len=8):
return F.cross_entropy(
logits[:, :prefix_len].reshape(-1, vocab_size),
targets[:, :prefix_len].reshape(-1)
)
```
---
## 6. Evaluation Plan
### Benchmarks
| Benchmark | What It Tests | Baseline Needed |
|-----------|---------------|-----------------|
| GSM8K | Arithmetic reasoning | Yes |
| ARC-Challenge | Science reasoning | Yes |
| CommonSenseQA | Commonsense | Yes |
| HumanEval | Code generation | Yes |
| Planning tasks (dl1683) | Multi-step planning | Yes |
### Comparison Matrix
| Configuration | Training Time | Expected Gain |
|---------------|---------------|---------------|
| Baseline (no prefix) | 1x | 0% |
| Random prefix (inference) | 1x | +10-20%? |
| Pause tokens (trained) | 1.1x | +8-18% |
| UPFT only | 0.25x | Match baseline |
| Pause + UPFT | 0.3x | +8-18% |
| COCONUT curriculum | 2x | +15-25%? |
---
## 7. Open Questions
1. **Does random perturbation scale to 27B?** Tested on 4B - effect may differ
2. **Optimal token count for 27B?** 2 optimal for 4B, might change
3. **DeltaNet interaction?** 75% linear attention is untested territory
4. **Composition effects?** Prefix + steering + pause tokens together?
5. **GDN as continuous thought channel?** Novel research direction
---
## 8. Risk Assessment
| Risk | Mitigation |
|------|------------|
| No improvement at 27B scale | Start with inference-time validation |
| Training instability with pause tokens | Start with 2 tokens, scale up |
| UPFT doesn't transfer | Fall back to full token loss |
| DeltaNet behaves differently | Ablate on attention-only layers first |
---
## 9. Timeline Estimate
| Phase | Duration | Deliverable |
|-------|----------|-------------|
| Embedding RMS + baseline | 1 day | Numbers |
| Random prefix validation | 1 day | Inference results |
| Pause token implementation | 2 days | Crane modification |
| UPFT integration | 1 day | Training loop change |
| First finetuning run | 2-3 days | Trained model |
| Evaluation | 1 day | Benchmark numbers |
| COCONUT curriculum | 1 week | Staged training |
---
## 10. References
### Primary Sources
- Random Prefix: https://github.com/dl1683/Latent-Space-Reasoning
- Attention Sinks: Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (Sept 2023)
- Pause Tokens: Google, "Think before you speak" (Oct 2023)
- COCONUT: Meta, "Training Large Language Models to Reason in a Continuous Latent Space" (Dec 2024)
- UPFT: "Prefix Self-Consistency for Unsupervised Fine-Tuning" (Mar 2025)
- ActAdd: Turner et al., "Activation Addition: Steering Language Models Without Optimization" (Aug 2023)
- Recurrent Depth: Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning" (Feb 2025)
- Ouro: ByteDance, "Ouro: Scaling Reasoning with Latent Thoughts" (2025)
- Planning Tokens: ICLR 2024
### Our Existing Work
- `steering-vector-empirical` - listening vector extraction
- `skills-apollo-optimizer-qwen35-gotcha` - APOLLO parameter grouping
- `qwen-3-5-27b-architecture-findings` - model architecture details
- `training-pipeline-fused-inference-training-mar27` - training infrastructure
---
*Research complete 2026-04-12. Ready for implementation.*