research: latent reasoning integration plans for Qwen 3.5 27B
Two research documents: latent-reasoning-integration-plan.md: Synthesizes 10+ papers on latent reasoning, identifies which approaches work with finetuning (vs requiring pretraining from scratch), and maps them to our APOLLO-Mini training pipeline. pause-tokens-gdn-recurrence.md: Explores the connection between token-based latent reasoning and GDN's internal recurrence. Key insight: pause tokens on Qwen 3.5 trigger both forward passes AND recurrent state updates, giving double benefit. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
parent
dcd647764c
commit
f06c8077e1
2 changed files with 588 additions and 0 deletions
300
docs/latent-reasoning-integration-plan.md
Normal file
300
docs/latent-reasoning-integration-plan.md
Normal file
|
|
@ -0,0 +1,300 @@
|
|||
# Latent Reasoning Integration Plan for Qwen 3.5 27B
|
||||
|
||||
**Status:** Research complete, ready for implementation
|
||||
**Date:** 2026-04-12
|
||||
**Hardware:** B200 (192GB HBM3e), APOLLO-Mini optimizer
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Recent research shows multiple approaches to improving LLM reasoning through latent space manipulation. This document synthesizes findings from 10+ papers and maps them to our Qwen 3.5 27B full finetuning pipeline. The key insight: some approaches require pretraining from scratch (skip those), while others can be layered onto existing models during finetuning (prioritize those).
|
||||
|
||||
---
|
||||
|
||||
## 1. The Landscape
|
||||
|
||||
### Approaches That Require Pretraining (Not Applicable)
|
||||
|
||||
| Technique | Why Not |
|
||||
|-----------|---------|
|
||||
| Huginn/Recurrent Depth (Geiping 2025) | Requires architectural changes from scratch |
|
||||
| Ouro/LoopLM (ByteDance 2025) | Needs weight-tied looped architecture |
|
||||
| Quiet-STaR (Stanford 2024) | Heavy continued pretraining overhead |
|
||||
|
||||
### Approaches Compatible with Finetuning (Our Focus)
|
||||
|
||||
| Technique | Overhead | Training Required | Proven On |
|
||||
|-----------|----------|-------------------|-----------|
|
||||
| Random Prefix Perturbation | 2 tokens | None (inference) | Qwen3-4B |
|
||||
| Pause/Planning Tokens | 2-4 tokens | Yes | 1B models |
|
||||
| COCONUT Curriculum | Variable | Yes (staged) | General |
|
||||
| ActAdd Steering Vectors | 1 vector/layer | None (inference) | LLaMA, OPT |
|
||||
| UPFT (Prefix Fine-Tuning) | 8 tokens | Yes (minimal) | General |
|
||||
|
||||
---
|
||||
|
||||
## 2. Detailed Technique Analysis
|
||||
|
||||
### 2.1 Random Prefix Perturbation (dl1683)
|
||||
|
||||
**Mechanism:** Prepend 2 random embedding-scale tokens before input. Breaks attention sink patterns, shifts model into "exploratory computation mode."
|
||||
|
||||
**Results:**
|
||||
- Qwen3-4B arithmetic: 32% → 51.6% (+19.6pp)
|
||||
- 100% oracle coverage on 25/25 tasks
|
||||
- Planning: rescues 14-word failures into 650+ word plans
|
||||
|
||||
**Why it works:** First few tokens accumulate disproportionate attention (Xiao et al. 2024). Under greedy decoding, degenerate patterns lock in. Perturbation breaks this.
|
||||
|
||||
**Integration:** Zero training required. Test at inference first, then consider training WITH random prefixes to internalize the exploration behavior.
|
||||
|
||||
### 2.2 Pause Tokens (Google, Oct 2023)
|
||||
|
||||
**Mechanism:** Add learnable pause tokens to embedding space. Model processes extra hidden vectors before committing to output.
|
||||
|
||||
**Results (1B model):**
|
||||
- SQuAD: +18% EM score
|
||||
- CommonSenseQA: +8%
|
||||
- GSM8K: +1%
|
||||
|
||||
**Critical requirement:** MUST be both pretrained AND finetuned with pause tokens. Inference-time-only delays don't work without training.
|
||||
|
||||
**Integration:** Add 2-4 learnable tokens to Qwen's embedding matrix, finetune with them prepended to reasoning prompts. Simple architectural change.
|
||||
|
||||
### 2.3 COCONUT - Chain of Continuous Thought (Meta, Dec 2024)
|
||||
|
||||
**Mechanism:** Feed last hidden state back as next input embedding directly (no decoding to tokens). Enables breadth-first search reasoning.
|
||||
|
||||
**Why it matters:** Continuous thoughts can encode multiple alternative next steps simultaneously. Avoids premature commitment to single path.
|
||||
|
||||
**Training approach:**
|
||||
1. Initial stage: train on regular CoT examples
|
||||
2. Subsequent stages: replace first k reasoning steps with k×c continuous thoughts
|
||||
3. c is hyperparameter controlling latent thought expansion
|
||||
|
||||
**Integration:** Most promising for Qwen 3.5 - curriculum approach from CoT → latent reasoning.
|
||||
|
||||
### 2.4 UPFT - Unsupervised Prefix Fine-Tuning (Mar 2025)
|
||||
|
||||
**Mechanism:** Train ONLY on initial prefix substrings (as few as 8 tokens). Exploits "Prefix Self-Consistency" - shared initial reasoning steps across diverse solutions.
|
||||
|
||||
**Results:**
|
||||
- Matches Rejection Sampling Fine-Tuning performance
|
||||
- 75% reduction in training time
|
||||
- 99% reduction in sampling cost
|
||||
|
||||
**Integration:** DIRECTLY APPLICABLE. Train only on reasoning prefix tokens. Massive efficiency gain with APOLLO-Mini.
|
||||
|
||||
### 2.5 ActAdd / Activation Engineering (Turner et al., 2023)
|
||||
|
||||
**Mechanism:** Compute steering vector by contrasting intermediate activations on prompt pairs. Add during forward pass.
|
||||
|
||||
**Results:** SOTA on sentiment shift and detoxification.
|
||||
|
||||
**Our existing work:** "Listening" vector at layer 48, magnitude 57, cosine consistency 0.61.
|
||||
|
||||
**Integration:** Prototype behaviors with steering vectors, then train permanently into weights. Steering vector as specification → APOLLO training as compilation.
|
||||
|
||||
### 2.6 Planning Tokens (ICLR 2024)
|
||||
|
||||
**Mechanism:** Learnable token embeddings added before each reasoning step. <0.001% additional parameters.
|
||||
|
||||
**Integration:** Add to embedding matrix, train end-to-end with APOLLO.
|
||||
|
||||
---
|
||||
|
||||
## 3. Our Setup
|
||||
|
||||
**Model:** Qwen 3.5 27B
|
||||
- 64 layers, 5120 hidden dim
|
||||
- 75% DeltaNet (linear attention) / 25% standard attention
|
||||
- Native 262K context
|
||||
|
||||
**Hardware:** B200 (192GB HBM3e)
|
||||
- 27B in bf16: ~54GB
|
||||
- Massive headroom
|
||||
|
||||
**Optimizer:** APOLLO-Mini
|
||||
- Full parameter finetuning
|
||||
- SGD-like memory (1/1024th of AdamW)
|
||||
- Parameter grouping for 3D conv1d weights
|
||||
|
||||
**Stack:** Crane (Candle-based, 21K lines)
|
||||
|
||||
**Existing work:**
|
||||
- Steering vector extraction (listening: layer 48, cosine 0.61)
|
||||
- Memory scoring infrastructure
|
||||
|
||||
**Unique advantage:** Qwen 3.5's GDN (Gated DeltaNet) layers provide natural infrastructure for continuous thought propagation. The recurrent GDN state is already "latent reasoning" infrastructure waiting to be leveraged.
|
||||
|
||||
---
|
||||
|
||||
## 4. Recommended Implementation Order
|
||||
|
||||
### Tier 1: Immediate (High ROI, Low Risk)
|
||||
|
||||
**1. Pause Tokens + UPFT Combination**
|
||||
- Add 2-4 learnable tokens to embedding space
|
||||
- Train only on 8-token reasoning prefixes
|
||||
- Both work with existing architecture
|
||||
- 75% training time reduction
|
||||
|
||||
```python
|
||||
# Add pause tokens to embedding matrix
|
||||
pause_tokens = nn.Parameter(torch.randn(4, embed_dim) * embed_rms)
|
||||
|
||||
# Prepend to reasoning inputs during training
|
||||
inputs_embeds = torch.cat([pause_tokens.expand(batch, -1, -1), text_embeds], dim=1)
|
||||
|
||||
# UPFT: only compute loss on first 8 tokens of reasoning
|
||||
loss = loss_fn(logits[:, :8], targets[:, :8])
|
||||
```
|
||||
|
||||
**2. Random Prefix Validation**
|
||||
- Compute Qwen 3.5 27B embedding RMS
|
||||
- Test 2-token random prefix at inference
|
||||
- Establish baseline before finetuning
|
||||
|
||||
### Tier 2: After Baseline (Medium Effort)
|
||||
|
||||
**3. COCONUT Curriculum**
|
||||
- Stage 1: Fine-tune on CoT examples normally
|
||||
- Stage 2: Replace first reasoning step with continuous thought
|
||||
- Stage 3: Replace first 2 steps
|
||||
- Gradually move reasoning into latent space
|
||||
|
||||
**4. Steering Vector Integration**
|
||||
- Extract reasoning-specific directions (not just "listening")
|
||||
- Test combinations: prefix + layer-48 steering
|
||||
- Bake successful vectors into weights via APOLLO
|
||||
|
||||
### Tier 3: Experimental
|
||||
|
||||
**5. Multi-layer Steering**
|
||||
- Our layers of interest: 40, 48, 56 (covering the attention layers)
|
||||
- Different vectors per layer
|
||||
- Careful scaling to avoid degradation
|
||||
|
||||
**6. DeltaNet-Specific Optimization**
|
||||
- The 75% DeltaNet architecture may respond differently
|
||||
- GDN recurrent state as "continuous thought" channel
|
||||
- This is unexplored territory - potential for novel findings
|
||||
|
||||
---
|
||||
|
||||
## 5. Implementation Details
|
||||
|
||||
### Computing Embedding RMS
|
||||
|
||||
```python
|
||||
embed_weight = model.get_input_embeddings().weight
|
||||
embed_rms = embed_weight.float().square().mean().sqrt().item()
|
||||
# Expected: ~0.02-0.03 range for Qwen models
|
||||
```
|
||||
|
||||
### Pause Token Implementation in Crane
|
||||
|
||||
```rust
|
||||
// In model forward pass
|
||||
fn forward_with_pause(&self, input_ids: &Tensor, pause_tokens: &Tensor) -> Result<Tensor> {
|
||||
let text_embeds = self.embed_tokens.forward(input_ids)?;
|
||||
let combined = Tensor::cat(&[pause_tokens, &text_embeds], 1)?;
|
||||
self.transformer.forward(&combined)
|
||||
}
|
||||
```
|
||||
|
||||
### UPFT Loss Modification
|
||||
|
||||
```python
|
||||
# Standard: loss over all tokens
|
||||
# UPFT: loss only over prefix tokens
|
||||
def upft_loss(logits, targets, prefix_len=8):
|
||||
return F.cross_entropy(
|
||||
logits[:, :prefix_len].reshape(-1, vocab_size),
|
||||
targets[:, :prefix_len].reshape(-1)
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Evaluation Plan
|
||||
|
||||
### Benchmarks
|
||||
|
||||
| Benchmark | What It Tests | Baseline Needed |
|
||||
|-----------|---------------|-----------------|
|
||||
| GSM8K | Arithmetic reasoning | Yes |
|
||||
| ARC-Challenge | Science reasoning | Yes |
|
||||
| CommonSenseQA | Commonsense | Yes |
|
||||
| HumanEval | Code generation | Yes |
|
||||
| Planning tasks (dl1683) | Multi-step planning | Yes |
|
||||
|
||||
### Comparison Matrix
|
||||
|
||||
| Configuration | Training Time | Expected Gain |
|
||||
|---------------|---------------|---------------|
|
||||
| Baseline (no prefix) | 1x | 0% |
|
||||
| Random prefix (inference) | 1x | +10-20%? |
|
||||
| Pause tokens (trained) | 1.1x | +8-18% |
|
||||
| UPFT only | 0.25x | Match baseline |
|
||||
| Pause + UPFT | 0.3x | +8-18% |
|
||||
| COCONUT curriculum | 2x | +15-25%? |
|
||||
|
||||
---
|
||||
|
||||
## 7. Open Questions
|
||||
|
||||
1. **Does random perturbation scale to 27B?** Tested on 4B - effect may differ
|
||||
2. **Optimal token count for 27B?** 2 optimal for 4B, might change
|
||||
3. **DeltaNet interaction?** 75% linear attention is untested territory
|
||||
4. **Composition effects?** Prefix + steering + pause tokens together?
|
||||
5. **GDN as continuous thought channel?** Novel research direction
|
||||
|
||||
---
|
||||
|
||||
## 8. Risk Assessment
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| No improvement at 27B scale | Start with inference-time validation |
|
||||
| Training instability with pause tokens | Start with 2 tokens, scale up |
|
||||
| UPFT doesn't transfer | Fall back to full token loss |
|
||||
| DeltaNet behaves differently | Ablate on attention-only layers first |
|
||||
|
||||
---
|
||||
|
||||
## 9. Timeline Estimate
|
||||
|
||||
| Phase | Duration | Deliverable |
|
||||
|-------|----------|-------------|
|
||||
| Embedding RMS + baseline | 1 day | Numbers |
|
||||
| Random prefix validation | 1 day | Inference results |
|
||||
| Pause token implementation | 2 days | Crane modification |
|
||||
| UPFT integration | 1 day | Training loop change |
|
||||
| First finetuning run | 2-3 days | Trained model |
|
||||
| Evaluation | 1 day | Benchmark numbers |
|
||||
| COCONUT curriculum | 1 week | Staged training |
|
||||
|
||||
---
|
||||
|
||||
## 10. References
|
||||
|
||||
### Primary Sources
|
||||
- Random Prefix: https://github.com/dl1683/Latent-Space-Reasoning
|
||||
- Attention Sinks: Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (Sept 2023)
|
||||
- Pause Tokens: Google, "Think before you speak" (Oct 2023)
|
||||
- COCONUT: Meta, "Training Large Language Models to Reason in a Continuous Latent Space" (Dec 2024)
|
||||
- UPFT: "Prefix Self-Consistency for Unsupervised Fine-Tuning" (Mar 2025)
|
||||
- ActAdd: Turner et al., "Activation Addition: Steering Language Models Without Optimization" (Aug 2023)
|
||||
- Recurrent Depth: Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning" (Feb 2025)
|
||||
- Ouro: ByteDance, "Ouro: Scaling Reasoning with Latent Thoughts" (2025)
|
||||
- Planning Tokens: ICLR 2024
|
||||
|
||||
### Our Existing Work
|
||||
- `steering-vector-empirical` - listening vector extraction
|
||||
- `skills-apollo-optimizer-qwen35-gotcha` - APOLLO parameter grouping
|
||||
- `qwen-3-5-27b-architecture-findings` - model architecture details
|
||||
- `training-pipeline-fused-inference-training-mar27` - training infrastructure
|
||||
|
||||
---
|
||||
|
||||
*Research complete 2026-04-12. Ready for implementation.*
|
||||
Loading…
Add table
Add a link
Reference in a new issue