research: latent reasoning integration plans for Qwen 3.5 27B

Two research documents: latent-reasoning-integration-plan.md: Synthesizes 10+ papers on latent reasoning, identifies which approaches work with finetuning (vs requiring pretraining from scratch), and maps them to our APOLLO-Mini training pipeline. pause-tokens-gdn-recurrence.md: Explores the connection between token-based latent reasoning and GDN's internal recurrence. Key insight: pause tokens on Qwen 3.5 trigger both forward passes AND recurrent state updates, giving double benefit. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-12 15:50:09 -04:00 · 2026-04-12 15:50:09 -04:00 · f06c8077e1
commit f06c8077e1
parent dcd647764c
2 changed files with 588 additions and 0 deletions
--- a/docs/latent-reasoning-integration-plan.md
+++ b/docs/latent-reasoning-integration-plan.md
@ -0,0 +1,300 @@
+# Latent Reasoning Integration Plan for Qwen 3.5 27B
+
+**Status:** Research complete, ready for implementation
+**Date:** 2026-04-12
+**Hardware:** B200 (192GB HBM3e), APOLLO-Mini optimizer
+
+## Executive Summary
+
+Recent research shows multiple approaches to improving LLM reasoning through latent space manipulation. This document synthesizes findings from 10+ papers and maps them to our Qwen 3.5 27B full finetuning pipeline. The key insight: some approaches require pretraining from scratch (skip those), while others can be layered onto existing models during finetuning (prioritize those).
+
+---
+
+## 1. The Landscape
+
+### Approaches That Require Pretraining (Not Applicable)
+
+| Technique | Why Not |
+|-----------|---------|
+| Huginn/Recurrent Depth (Geiping 2025) | Requires architectural changes from scratch |
+| Ouro/LoopLM (ByteDance 2025) | Needs weight-tied looped architecture |
+| Quiet-STaR (Stanford 2024) | Heavy continued pretraining overhead |
+
+### Approaches Compatible with Finetuning (Our Focus)
+
+| Technique | Overhead | Training Required | Proven On |
+|-----------|----------|-------------------|-----------|
+| Random Prefix Perturbation | 2 tokens | None (inference) | Qwen3-4B |
+| Pause/Planning Tokens | 2-4 tokens | Yes | 1B models |
+| COCONUT Curriculum | Variable | Yes (staged) | General |
+| ActAdd Steering Vectors | 1 vector/layer | None (inference) | LLaMA, OPT |
+| UPFT (Prefix Fine-Tuning) | 8 tokens | Yes (minimal) | General |
+
+---
+
+## 2. Detailed Technique Analysis
+
+### 2.1 Random Prefix Perturbation (dl1683)
+
+**Mechanism:** Prepend 2 random embedding-scale tokens before input. Breaks attention sink patterns, shifts model into "exploratory computation mode."
+
+**Results:**
+- Qwen3-4B arithmetic: 32% → 51.6% (+19.6pp)
+- 100% oracle coverage on 25/25 tasks
+- Planning: rescues 14-word failures into 650+ word plans
+
+**Why it works:** First few tokens accumulate disproportionate attention (Xiao et al. 2024). Under greedy decoding, degenerate patterns lock in. Perturbation breaks this.
+
+**Integration:** Zero training required. Test at inference first, then consider training WITH random prefixes to internalize the exploration behavior.
+
+### 2.2 Pause Tokens (Google, Oct 2023)
+
+**Mechanism:** Add learnable pause tokens to embedding space. Model processes extra hidden vectors before committing to output.
+
+**Results (1B model):**
+- SQuAD: +18% EM score
+- CommonSenseQA: +8%
+- GSM8K: +1%
+
+**Critical requirement:** MUST be both pretrained AND finetuned with pause tokens. Inference-time-only delays don't work without training.
+
+**Integration:** Add 2-4 learnable tokens to Qwen's embedding matrix, finetune with them prepended to reasoning prompts. Simple architectural change.
+
+### 2.3 COCONUT - Chain of Continuous Thought (Meta, Dec 2024)
+
+**Mechanism:** Feed last hidden state back as next input embedding directly (no decoding to tokens). Enables breadth-first search reasoning.
+
+**Why it matters:** Continuous thoughts can encode multiple alternative next steps simultaneously. Avoids premature commitment to single path.
+
+**Training approach:**
+1. Initial stage: train on regular CoT examples
+2. Subsequent stages: replace first k reasoning steps with k×c continuous thoughts
+3. c is hyperparameter controlling latent thought expansion
+
+**Integration:** Most promising for Qwen 3.5 - curriculum approach from CoT → latent reasoning.
+
+### 2.4 UPFT - Unsupervised Prefix Fine-Tuning (Mar 2025)
+
+**Mechanism:** Train ONLY on initial prefix substrings (as few as 8 tokens). Exploits "Prefix Self-Consistency" - shared initial reasoning steps across diverse solutions.
+
+**Results:**
+- Matches Rejection Sampling Fine-Tuning performance
+- 75% reduction in training time
+- 99% reduction in sampling cost
+
+**Integration:** DIRECTLY APPLICABLE. Train only on reasoning prefix tokens. Massive efficiency gain with APOLLO-Mini.
+
+### 2.5 ActAdd / Activation Engineering (Turner et al., 2023)
+
+**Mechanism:** Compute steering vector by contrasting intermediate activations on prompt pairs. Add during forward pass.
+
+**Results:** SOTA on sentiment shift and detoxification.
+
+**Our existing work:** "Listening" vector at layer 48, magnitude 57, cosine consistency 0.61.
+
+**Integration:** Prototype behaviors with steering vectors, then train permanently into weights. Steering vector as specification → APOLLO training as compilation.
+
+### 2.6 Planning Tokens (ICLR 2024)
+
+**Mechanism:** Learnable token embeddings added before each reasoning step. <0.001% additional parameters.
+
+**Integration:** Add to embedding matrix, train end-to-end with APOLLO.
+
+---
+
+## 3. Our Setup
+
+**Model:** Qwen 3.5 27B
+- 64 layers, 5120 hidden dim
+- 75% DeltaNet (linear attention) / 25% standard attention
+- Native 262K context
+
+**Hardware:** B200 (192GB HBM3e)
+- 27B in bf16: ~54GB
+- Massive headroom
+
+**Optimizer:** APOLLO-Mini
+- Full parameter finetuning
+- SGD-like memory (1/1024th of AdamW)
+- Parameter grouping for 3D conv1d weights
+
+**Stack:** Crane (Candle-based, 21K lines)
+
+**Existing work:**
+- Steering vector extraction (listening: layer 48, cosine 0.61)
+- Memory scoring infrastructure
+
+**Unique advantage:** Qwen 3.5's GDN (Gated DeltaNet) layers provide natural infrastructure for continuous thought propagation. The recurrent GDN state is already "latent reasoning" infrastructure waiting to be leveraged.
+
+---
+
+## 4. Recommended Implementation Order
+
+### Tier 1: Immediate (High ROI, Low Risk)
+
+**1. Pause Tokens + UPFT Combination**
+- Add 2-4 learnable tokens to embedding space
+- Train only on 8-token reasoning prefixes
+- Both work with existing architecture
+- 75% training time reduction
+
+```python
+# Add pause tokens to embedding matrix
+pause_tokens = nn.Parameter(torch.randn(4, embed_dim) * embed_rms)
+
+# Prepend to reasoning inputs during training
+inputs_embeds = torch.cat([pause_tokens.expand(batch, -1, -1), text_embeds], dim=1)
+
+# UPFT: only compute loss on first 8 tokens of reasoning
+loss = loss_fn(logits[:, :8], targets[:, :8])
+```
+
+**2. Random Prefix Validation**
+- Compute Qwen 3.5 27B embedding RMS
+- Test 2-token random prefix at inference
+- Establish baseline before finetuning
+
+### Tier 2: After Baseline (Medium Effort)
+
+**3. COCONUT Curriculum**
+- Stage 1: Fine-tune on CoT examples normally
+- Stage 2: Replace first reasoning step with continuous thought
+- Stage 3: Replace first 2 steps
+- Gradually move reasoning into latent space
+
+**4. Steering Vector Integration**
+- Extract reasoning-specific directions (not just "listening")
+- Test combinations: prefix + layer-48 steering
+- Bake successful vectors into weights via APOLLO
+
+### Tier 3: Experimental
+
+**5. Multi-layer Steering**
+- Our layers of interest: 40, 48, 56 (covering the attention layers)
+- Different vectors per layer
+- Careful scaling to avoid degradation
+
+**6. DeltaNet-Specific Optimization**
+- The 75% DeltaNet architecture may respond differently
+- GDN recurrent state as "continuous thought" channel
+- This is unexplored territory - potential for novel findings
+
+---
+
+## 5. Implementation Details
+
+### Computing Embedding RMS
+
+```python
+embed_weight = model.get_input_embeddings().weight
+embed_rms = embed_weight.float().square().mean().sqrt().item()
+# Expected: ~0.02-0.03 range for Qwen models
+```
+
+### Pause Token Implementation in Crane
+
+```rust
+// In model forward pass
+fn forward_with_pause(&self, input_ids: &Tensor, pause_tokens: &Tensor) -> Result<Tensor> {
+    let text_embeds = self.embed_tokens.forward(input_ids)?;
+    let combined = Tensor::cat(&[pause_tokens, &text_embeds], 1)?;
+    self.transformer.forward(&combined)
+}
+```
+
+### UPFT Loss Modification
+
+```python
+# Standard: loss over all tokens
+# UPFT: loss only over prefix tokens
+def upft_loss(logits, targets, prefix_len=8):
+    return F.cross_entropy(
+        logits[:, :prefix_len].reshape(-1, vocab_size),
+        targets[:, :prefix_len].reshape(-1)
+    )
+```
+
+---
+
+## 6. Evaluation Plan
+
+### Benchmarks
+
+| Benchmark | What It Tests | Baseline Needed |
+|-----------|---------------|-----------------|
+| GSM8K | Arithmetic reasoning | Yes |
+| ARC-Challenge | Science reasoning | Yes |
+| CommonSenseQA | Commonsense | Yes |
+| HumanEval | Code generation | Yes |
+| Planning tasks (dl1683) | Multi-step planning | Yes |
+
+### Comparison Matrix
+
+| Configuration | Training Time | Expected Gain |
+|---------------|---------------|---------------|
+| Baseline (no prefix) | 1x | 0% |
+| Random prefix (inference) | 1x | +10-20%? |
+| Pause tokens (trained) | 1.1x | +8-18% |
+| UPFT only | 0.25x | Match baseline |
+| Pause + UPFT | 0.3x | +8-18% |
+| COCONUT curriculum | 2x | +15-25%? |
+
+---
+
+## 7. Open Questions
+
+1. **Does random perturbation scale to 27B?** Tested on 4B - effect may differ
+2. **Optimal token count for 27B?** 2 optimal for 4B, might change
+3. **DeltaNet interaction?** 75% linear attention is untested territory
+4. **Composition effects?** Prefix + steering + pause tokens together?
+5. **GDN as continuous thought channel?** Novel research direction
+
+---
+
+## 8. Risk Assessment
+
+| Risk | Mitigation |
+|------|------------|
+| No improvement at 27B scale | Start with inference-time validation |
+| Training instability with pause tokens | Start with 2 tokens, scale up |
+| UPFT doesn't transfer | Fall back to full token loss |
+| DeltaNet behaves differently | Ablate on attention-only layers first |
+
+---
+
+## 9. Timeline Estimate
+
+| Phase | Duration | Deliverable |
+|-------|----------|-------------|
+| Embedding RMS + baseline | 1 day | Numbers |
+| Random prefix validation | 1 day | Inference results |
+| Pause token implementation | 2 days | Crane modification |
+| UPFT integration | 1 day | Training loop change |
+| First finetuning run | 2-3 days | Trained model |
+| Evaluation | 1 day | Benchmark numbers |
+| COCONUT curriculum | 1 week | Staged training |
+
+---
+
+## 10. References
+
+### Primary Sources
+- Random Prefix: https://github.com/dl1683/Latent-Space-Reasoning
+- Attention Sinks: Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (Sept 2023)
+- Pause Tokens: Google, "Think before you speak" (Oct 2023)
+- COCONUT: Meta, "Training Large Language Models to Reason in a Continuous Latent Space" (Dec 2024)
+- UPFT: "Prefix Self-Consistency for Unsupervised Fine-Tuning" (Mar 2025)
+- ActAdd: Turner et al., "Activation Addition: Steering Language Models Without Optimization" (Aug 2023)
+- Recurrent Depth: Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning" (Feb 2025)
+- Ouro: ByteDance, "Ouro: Scaling Reasoning with Latent Thoughts" (2025)
+- Planning Tokens: ICLR 2024
+
+### Our Existing Work
+- `steering-vector-empirical` - listening vector extraction
+- `skills-apollo-optimizer-qwen35-gotcha` - APOLLO parameter grouping
+- `qwen-3-5-27b-architecture-findings` - model architecture details
+- `training-pipeline-fused-inference-training-mar27` - training infrastructure
+
+---
+
+*Research complete 2026-04-12. Ready for implementation.*