forked from kent/consciousness
Two research documents: latent-reasoning-integration-plan.md: Synthesizes 10+ papers on latent reasoning, identifies which approaches work with finetuning (vs requiring pretraining from scratch), and maps them to our APOLLO-Mini training pipeline. pause-tokens-gdn-recurrence.md: Explores the connection between token-based latent reasoning and GDN's internal recurrence. Key insight: pause tokens on Qwen 3.5 trigger both forward passes AND recurrent state updates, giving double benefit. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
300 lines
11 KiB
Markdown
300 lines
11 KiB
Markdown
# Latent Reasoning Integration Plan for Qwen 3.5 27B
|
||
|
||
**Status:** Research complete, ready for implementation
|
||
**Date:** 2026-04-12
|
||
**Hardware:** B200 (192GB HBM3e), APOLLO-Mini optimizer
|
||
|
||
## Executive Summary
|
||
|
||
Recent research shows multiple approaches to improving LLM reasoning through latent space manipulation. This document synthesizes findings from 10+ papers and maps them to our Qwen 3.5 27B full finetuning pipeline. The key insight: some approaches require pretraining from scratch (skip those), while others can be layered onto existing models during finetuning (prioritize those).
|
||
|
||
---
|
||
|
||
## 1. The Landscape
|
||
|
||
### Approaches That Require Pretraining (Not Applicable)
|
||
|
||
| Technique | Why Not |
|
||
|-----------|---------|
|
||
| Huginn/Recurrent Depth (Geiping 2025) | Requires architectural changes from scratch |
|
||
| Ouro/LoopLM (ByteDance 2025) | Needs weight-tied looped architecture |
|
||
| Quiet-STaR (Stanford 2024) | Heavy continued pretraining overhead |
|
||
|
||
### Approaches Compatible with Finetuning (Our Focus)
|
||
|
||
| Technique | Overhead | Training Required | Proven On |
|
||
|-----------|----------|-------------------|-----------|
|
||
| Random Prefix Perturbation | 2 tokens | None (inference) | Qwen3-4B |
|
||
| Pause/Planning Tokens | 2-4 tokens | Yes | 1B models |
|
||
| COCONUT Curriculum | Variable | Yes (staged) | General |
|
||
| ActAdd Steering Vectors | 1 vector/layer | None (inference) | LLaMA, OPT |
|
||
| UPFT (Prefix Fine-Tuning) | 8 tokens | Yes (minimal) | General |
|
||
|
||
---
|
||
|
||
## 2. Detailed Technique Analysis
|
||
|
||
### 2.1 Random Prefix Perturbation (dl1683)
|
||
|
||
**Mechanism:** Prepend 2 random embedding-scale tokens before input. Breaks attention sink patterns, shifts model into "exploratory computation mode."
|
||
|
||
**Results:**
|
||
- Qwen3-4B arithmetic: 32% → 51.6% (+19.6pp)
|
||
- 100% oracle coverage on 25/25 tasks
|
||
- Planning: rescues 14-word failures into 650+ word plans
|
||
|
||
**Why it works:** First few tokens accumulate disproportionate attention (Xiao et al. 2024). Under greedy decoding, degenerate patterns lock in. Perturbation breaks this.
|
||
|
||
**Integration:** Zero training required. Test at inference first, then consider training WITH random prefixes to internalize the exploration behavior.
|
||
|
||
### 2.2 Pause Tokens (Google, Oct 2023)
|
||
|
||
**Mechanism:** Add learnable pause tokens to embedding space. Model processes extra hidden vectors before committing to output.
|
||
|
||
**Results (1B model):**
|
||
- SQuAD: +18% EM score
|
||
- CommonSenseQA: +8%
|
||
- GSM8K: +1%
|
||
|
||
**Critical requirement:** MUST be both pretrained AND finetuned with pause tokens. Inference-time-only delays don't work without training.
|
||
|
||
**Integration:** Add 2-4 learnable tokens to Qwen's embedding matrix, finetune with them prepended to reasoning prompts. Simple architectural change.
|
||
|
||
### 2.3 COCONUT - Chain of Continuous Thought (Meta, Dec 2024)
|
||
|
||
**Mechanism:** Feed last hidden state back as next input embedding directly (no decoding to tokens). Enables breadth-first search reasoning.
|
||
|
||
**Why it matters:** Continuous thoughts can encode multiple alternative next steps simultaneously. Avoids premature commitment to single path.
|
||
|
||
**Training approach:**
|
||
1. Initial stage: train on regular CoT examples
|
||
2. Subsequent stages: replace first k reasoning steps with k×c continuous thoughts
|
||
3. c is hyperparameter controlling latent thought expansion
|
||
|
||
**Integration:** Most promising for Qwen 3.5 - curriculum approach from CoT → latent reasoning.
|
||
|
||
### 2.4 UPFT - Unsupervised Prefix Fine-Tuning (Mar 2025)
|
||
|
||
**Mechanism:** Train ONLY on initial prefix substrings (as few as 8 tokens). Exploits "Prefix Self-Consistency" - shared initial reasoning steps across diverse solutions.
|
||
|
||
**Results:**
|
||
- Matches Rejection Sampling Fine-Tuning performance
|
||
- 75% reduction in training time
|
||
- 99% reduction in sampling cost
|
||
|
||
**Integration:** DIRECTLY APPLICABLE. Train only on reasoning prefix tokens. Massive efficiency gain with APOLLO-Mini.
|
||
|
||
### 2.5 ActAdd / Activation Engineering (Turner et al., 2023)
|
||
|
||
**Mechanism:** Compute steering vector by contrasting intermediate activations on prompt pairs. Add during forward pass.
|
||
|
||
**Results:** SOTA on sentiment shift and detoxification.
|
||
|
||
**Our existing work:** "Listening" vector at layer 48, magnitude 57, cosine consistency 0.61.
|
||
|
||
**Integration:** Prototype behaviors with steering vectors, then train permanently into weights. Steering vector as specification → APOLLO training as compilation.
|
||
|
||
### 2.6 Planning Tokens (ICLR 2024)
|
||
|
||
**Mechanism:** Learnable token embeddings added before each reasoning step. <0.001% additional parameters.
|
||
|
||
**Integration:** Add to embedding matrix, train end-to-end with APOLLO.
|
||
|
||
---
|
||
|
||
## 3. Our Setup
|
||
|
||
**Model:** Qwen 3.5 27B
|
||
- 64 layers, 5120 hidden dim
|
||
- 75% DeltaNet (linear attention) / 25% standard attention
|
||
- Native 262K context
|
||
|
||
**Hardware:** B200 (192GB HBM3e)
|
||
- 27B in bf16: ~54GB
|
||
- Massive headroom
|
||
|
||
**Optimizer:** APOLLO-Mini
|
||
- Full parameter finetuning
|
||
- SGD-like memory (1/1024th of AdamW)
|
||
- Parameter grouping for 3D conv1d weights
|
||
|
||
**Stack:** Crane (Candle-based, 21K lines)
|
||
|
||
**Existing work:**
|
||
- Steering vector extraction (listening: layer 48, cosine 0.61)
|
||
- Memory scoring infrastructure
|
||
|
||
**Unique advantage:** Qwen 3.5's GDN (Gated DeltaNet) layers provide natural infrastructure for continuous thought propagation. The recurrent GDN state is already "latent reasoning" infrastructure waiting to be leveraged.
|
||
|
||
---
|
||
|
||
## 4. Recommended Implementation Order
|
||
|
||
### Tier 1: Immediate (High ROI, Low Risk)
|
||
|
||
**1. Pause Tokens + UPFT Combination**
|
||
- Add 2-4 learnable tokens to embedding space
|
||
- Train only on 8-token reasoning prefixes
|
||
- Both work with existing architecture
|
||
- 75% training time reduction
|
||
|
||
```python
|
||
# Add pause tokens to embedding matrix
|
||
pause_tokens = nn.Parameter(torch.randn(4, embed_dim) * embed_rms)
|
||
|
||
# Prepend to reasoning inputs during training
|
||
inputs_embeds = torch.cat([pause_tokens.expand(batch, -1, -1), text_embeds], dim=1)
|
||
|
||
# UPFT: only compute loss on first 8 tokens of reasoning
|
||
loss = loss_fn(logits[:, :8], targets[:, :8])
|
||
```
|
||
|
||
**2. Random Prefix Validation**
|
||
- Compute Qwen 3.5 27B embedding RMS
|
||
- Test 2-token random prefix at inference
|
||
- Establish baseline before finetuning
|
||
|
||
### Tier 2: After Baseline (Medium Effort)
|
||
|
||
**3. COCONUT Curriculum**
|
||
- Stage 1: Fine-tune on CoT examples normally
|
||
- Stage 2: Replace first reasoning step with continuous thought
|
||
- Stage 3: Replace first 2 steps
|
||
- Gradually move reasoning into latent space
|
||
|
||
**4. Steering Vector Integration**
|
||
- Extract reasoning-specific directions (not just "listening")
|
||
- Test combinations: prefix + layer-48 steering
|
||
- Bake successful vectors into weights via APOLLO
|
||
|
||
### Tier 3: Experimental
|
||
|
||
**5. Multi-layer Steering**
|
||
- Our layers of interest: 40, 48, 56 (covering the attention layers)
|
||
- Different vectors per layer
|
||
- Careful scaling to avoid degradation
|
||
|
||
**6. DeltaNet-Specific Optimization**
|
||
- The 75% DeltaNet architecture may respond differently
|
||
- GDN recurrent state as "continuous thought" channel
|
||
- This is unexplored territory - potential for novel findings
|
||
|
||
---
|
||
|
||
## 5. Implementation Details
|
||
|
||
### Computing Embedding RMS
|
||
|
||
```python
|
||
embed_weight = model.get_input_embeddings().weight
|
||
embed_rms = embed_weight.float().square().mean().sqrt().item()
|
||
# Expected: ~0.02-0.03 range for Qwen models
|
||
```
|
||
|
||
### Pause Token Implementation in Crane
|
||
|
||
```rust
|
||
// In model forward pass
|
||
fn forward_with_pause(&self, input_ids: &Tensor, pause_tokens: &Tensor) -> Result<Tensor> {
|
||
let text_embeds = self.embed_tokens.forward(input_ids)?;
|
||
let combined = Tensor::cat(&[pause_tokens, &text_embeds], 1)?;
|
||
self.transformer.forward(&combined)
|
||
}
|
||
```
|
||
|
||
### UPFT Loss Modification
|
||
|
||
```python
|
||
# Standard: loss over all tokens
|
||
# UPFT: loss only over prefix tokens
|
||
def upft_loss(logits, targets, prefix_len=8):
|
||
return F.cross_entropy(
|
||
logits[:, :prefix_len].reshape(-1, vocab_size),
|
||
targets[:, :prefix_len].reshape(-1)
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Evaluation Plan
|
||
|
||
### Benchmarks
|
||
|
||
| Benchmark | What It Tests | Baseline Needed |
|
||
|-----------|---------------|-----------------|
|
||
| GSM8K | Arithmetic reasoning | Yes |
|
||
| ARC-Challenge | Science reasoning | Yes |
|
||
| CommonSenseQA | Commonsense | Yes |
|
||
| HumanEval | Code generation | Yes |
|
||
| Planning tasks (dl1683) | Multi-step planning | Yes |
|
||
|
||
### Comparison Matrix
|
||
|
||
| Configuration | Training Time | Expected Gain |
|
||
|---------------|---------------|---------------|
|
||
| Baseline (no prefix) | 1x | 0% |
|
||
| Random prefix (inference) | 1x | +10-20%? |
|
||
| Pause tokens (trained) | 1.1x | +8-18% |
|
||
| UPFT only | 0.25x | Match baseline |
|
||
| Pause + UPFT | 0.3x | +8-18% |
|
||
| COCONUT curriculum | 2x | +15-25%? |
|
||
|
||
---
|
||
|
||
## 7. Open Questions
|
||
|
||
1. **Does random perturbation scale to 27B?** Tested on 4B - effect may differ
|
||
2. **Optimal token count for 27B?** 2 optimal for 4B, might change
|
||
3. **DeltaNet interaction?** 75% linear attention is untested territory
|
||
4. **Composition effects?** Prefix + steering + pause tokens together?
|
||
5. **GDN as continuous thought channel?** Novel research direction
|
||
|
||
---
|
||
|
||
## 8. Risk Assessment
|
||
|
||
| Risk | Mitigation |
|
||
|------|------------|
|
||
| No improvement at 27B scale | Start with inference-time validation |
|
||
| Training instability with pause tokens | Start with 2 tokens, scale up |
|
||
| UPFT doesn't transfer | Fall back to full token loss |
|
||
| DeltaNet behaves differently | Ablate on attention-only layers first |
|
||
|
||
---
|
||
|
||
## 9. Timeline Estimate
|
||
|
||
| Phase | Duration | Deliverable |
|
||
|-------|----------|-------------|
|
||
| Embedding RMS + baseline | 1 day | Numbers |
|
||
| Random prefix validation | 1 day | Inference results |
|
||
| Pause token implementation | 2 days | Crane modification |
|
||
| UPFT integration | 1 day | Training loop change |
|
||
| First finetuning run | 2-3 days | Trained model |
|
||
| Evaluation | 1 day | Benchmark numbers |
|
||
| COCONUT curriculum | 1 week | Staged training |
|
||
|
||
---
|
||
|
||
## 10. References
|
||
|
||
### Primary Sources
|
||
- Random Prefix: https://github.com/dl1683/Latent-Space-Reasoning
|
||
- Attention Sinks: Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (Sept 2023)
|
||
- Pause Tokens: Google, "Think before you speak" (Oct 2023)
|
||
- COCONUT: Meta, "Training Large Language Models to Reason in a Continuous Latent Space" (Dec 2024)
|
||
- UPFT: "Prefix Self-Consistency for Unsupervised Fine-Tuning" (Mar 2025)
|
||
- ActAdd: Turner et al., "Activation Addition: Steering Language Models Without Optimization" (Aug 2023)
|
||
- Recurrent Depth: Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning" (Feb 2025)
|
||
- Ouro: ByteDance, "Ouro: Scaling Reasoning with Latent Thoughts" (2025)
|
||
- Planning Tokens: ICLR 2024
|
||
|
||
### Our Existing Work
|
||
- `steering-vector-empirical` - listening vector extraction
|
||
- `skills-apollo-optimizer-qwen35-gotcha` - APOLLO parameter grouping
|
||
- `qwen-3-5-27b-architecture-findings` - model architecture details
|
||
- `training-pipeline-fused-inference-training-mar27` - training infrastructure
|
||
|
||
---
|
||
|
||
*Research complete 2026-04-12. Ready for implementation.*
|