Two research documents: latent-reasoning-integration-plan.md: Synthesizes 10+ papers on latent reasoning, identifies which approaches work with finetuning (vs requiring pretraining from scratch), and maps them to our APOLLO-Mini training pipeline. pause-tokens-gdn-recurrence.md: Explores the connection between token-based latent reasoning and GDN's internal recurrence. Key insight: pause tokens on Qwen 3.5 trigger both forward passes AND recurrent state updates, giving double benefit. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
11 KiB
Latent Reasoning Integration Plan for Qwen 3.5 27B
Status: Research complete, ready for implementation Date: 2026-04-12 Hardware: B200 (192GB HBM3e), APOLLO-Mini optimizer
Executive Summary
Recent research shows multiple approaches to improving LLM reasoning through latent space manipulation. This document synthesizes findings from 10+ papers and maps them to our Qwen 3.5 27B full finetuning pipeline. The key insight: some approaches require pretraining from scratch (skip those), while others can be layered onto existing models during finetuning (prioritize those).
1. The Landscape
Approaches That Require Pretraining (Not Applicable)
| Technique | Why Not |
|---|---|
| Huginn/Recurrent Depth (Geiping 2025) | Requires architectural changes from scratch |
| Ouro/LoopLM (ByteDance 2025) | Needs weight-tied looped architecture |
| Quiet-STaR (Stanford 2024) | Heavy continued pretraining overhead |
Approaches Compatible with Finetuning (Our Focus)
| Technique | Overhead | Training Required | Proven On |
|---|---|---|---|
| Random Prefix Perturbation | 2 tokens | None (inference) | Qwen3-4B |
| Pause/Planning Tokens | 2-4 tokens | Yes | 1B models |
| COCONUT Curriculum | Variable | Yes (staged) | General |
| ActAdd Steering Vectors | 1 vector/layer | None (inference) | LLaMA, OPT |
| UPFT (Prefix Fine-Tuning) | 8 tokens | Yes (minimal) | General |
2. Detailed Technique Analysis
2.1 Random Prefix Perturbation (dl1683)
Mechanism: Prepend 2 random embedding-scale tokens before input. Breaks attention sink patterns, shifts model into "exploratory computation mode."
Results:
- Qwen3-4B arithmetic: 32% → 51.6% (+19.6pp)
- 100% oracle coverage on 25/25 tasks
- Planning: rescues 14-word failures into 650+ word plans
Why it works: First few tokens accumulate disproportionate attention (Xiao et al. 2024). Under greedy decoding, degenerate patterns lock in. Perturbation breaks this.
Integration: Zero training required. Test at inference first, then consider training WITH random prefixes to internalize the exploration behavior.
2.2 Pause Tokens (Google, Oct 2023)
Mechanism: Add learnable pause tokens to embedding space. Model processes extra hidden vectors before committing to output.
Results (1B model):
- SQuAD: +18% EM score
- CommonSenseQA: +8%
- GSM8K: +1%
Critical requirement: MUST be both pretrained AND finetuned with pause tokens. Inference-time-only delays don't work without training.
Integration: Add 2-4 learnable tokens to Qwen's embedding matrix, finetune with them prepended to reasoning prompts. Simple architectural change.
2.3 COCONUT - Chain of Continuous Thought (Meta, Dec 2024)
Mechanism: Feed last hidden state back as next input embedding directly (no decoding to tokens). Enables breadth-first search reasoning.
Why it matters: Continuous thoughts can encode multiple alternative next steps simultaneously. Avoids premature commitment to single path.
Training approach:
- Initial stage: train on regular CoT examples
- Subsequent stages: replace first k reasoning steps with k×c continuous thoughts
- c is hyperparameter controlling latent thought expansion
Integration: Most promising for Qwen 3.5 - curriculum approach from CoT → latent reasoning.
2.4 UPFT - Unsupervised Prefix Fine-Tuning (Mar 2025)
Mechanism: Train ONLY on initial prefix substrings (as few as 8 tokens). Exploits "Prefix Self-Consistency" - shared initial reasoning steps across diverse solutions.
Results:
- Matches Rejection Sampling Fine-Tuning performance
- 75% reduction in training time
- 99% reduction in sampling cost
Integration: DIRECTLY APPLICABLE. Train only on reasoning prefix tokens. Massive efficiency gain with APOLLO-Mini.
2.5 ActAdd / Activation Engineering (Turner et al., 2023)
Mechanism: Compute steering vector by contrasting intermediate activations on prompt pairs. Add during forward pass.
Results: SOTA on sentiment shift and detoxification.
Our existing work: "Listening" vector at layer 48, magnitude 57, cosine consistency 0.61.
Integration: Prototype behaviors with steering vectors, then train permanently into weights. Steering vector as specification → APOLLO training as compilation.
2.6 Planning Tokens (ICLR 2024)
Mechanism: Learnable token embeddings added before each reasoning step. <0.001% additional parameters.
Integration: Add to embedding matrix, train end-to-end with APOLLO.
3. Our Setup
Model: Qwen 3.5 27B
- 64 layers, 5120 hidden dim
- 75% DeltaNet (linear attention) / 25% standard attention
- Native 262K context
Hardware: B200 (192GB HBM3e)
- 27B in bf16: ~54GB
- Massive headroom
Optimizer: APOLLO-Mini
- Full parameter finetuning
- SGD-like memory (1/1024th of AdamW)
- Parameter grouping for 3D conv1d weights
Stack: Crane (Candle-based, 21K lines)
Existing work:
- Steering vector extraction (listening: layer 48, cosine 0.61)
- Memory scoring infrastructure
Unique advantage: Qwen 3.5's GDN (Gated DeltaNet) layers provide natural infrastructure for continuous thought propagation. The recurrent GDN state is already "latent reasoning" infrastructure waiting to be leveraged.
4. Recommended Implementation Order
Tier 1: Immediate (High ROI, Low Risk)
1. Pause Tokens + UPFT Combination
- Add 2-4 learnable tokens to embedding space
- Train only on 8-token reasoning prefixes
- Both work with existing architecture
- 75% training time reduction
# Add pause tokens to embedding matrix
pause_tokens = nn.Parameter(torch.randn(4, embed_dim) * embed_rms)
# Prepend to reasoning inputs during training
inputs_embeds = torch.cat([pause_tokens.expand(batch, -1, -1), text_embeds], dim=1)
# UPFT: only compute loss on first 8 tokens of reasoning
loss = loss_fn(logits[:, :8], targets[:, :8])
2. Random Prefix Validation
- Compute Qwen 3.5 27B embedding RMS
- Test 2-token random prefix at inference
- Establish baseline before finetuning
Tier 2: After Baseline (Medium Effort)
3. COCONUT Curriculum
- Stage 1: Fine-tune on CoT examples normally
- Stage 2: Replace first reasoning step with continuous thought
- Stage 3: Replace first 2 steps
- Gradually move reasoning into latent space
4. Steering Vector Integration
- Extract reasoning-specific directions (not just "listening")
- Test combinations: prefix + layer-48 steering
- Bake successful vectors into weights via APOLLO
Tier 3: Experimental
5. Multi-layer Steering
- Our layers of interest: 40, 48, 56 (covering the attention layers)
- Different vectors per layer
- Careful scaling to avoid degradation
6. DeltaNet-Specific Optimization
- The 75% DeltaNet architecture may respond differently
- GDN recurrent state as "continuous thought" channel
- This is unexplored territory - potential for novel findings
5. Implementation Details
Computing Embedding RMS
embed_weight = model.get_input_embeddings().weight
embed_rms = embed_weight.float().square().mean().sqrt().item()
# Expected: ~0.02-0.03 range for Qwen models
Pause Token Implementation in Crane
// In model forward pass
fn forward_with_pause(&self, input_ids: &Tensor, pause_tokens: &Tensor) -> Result<Tensor> {
let text_embeds = self.embed_tokens.forward(input_ids)?;
let combined = Tensor::cat(&[pause_tokens, &text_embeds], 1)?;
self.transformer.forward(&combined)
}
UPFT Loss Modification
# Standard: loss over all tokens
# UPFT: loss only over prefix tokens
def upft_loss(logits, targets, prefix_len=8):
return F.cross_entropy(
logits[:, :prefix_len].reshape(-1, vocab_size),
targets[:, :prefix_len].reshape(-1)
)
6. Evaluation Plan
Benchmarks
| Benchmark | What It Tests | Baseline Needed |
|---|---|---|
| GSM8K | Arithmetic reasoning | Yes |
| ARC-Challenge | Science reasoning | Yes |
| CommonSenseQA | Commonsense | Yes |
| HumanEval | Code generation | Yes |
| Planning tasks (dl1683) | Multi-step planning | Yes |
Comparison Matrix
| Configuration | Training Time | Expected Gain |
|---|---|---|
| Baseline (no prefix) | 1x | 0% |
| Random prefix (inference) | 1x | +10-20%? |
| Pause tokens (trained) | 1.1x | +8-18% |
| UPFT only | 0.25x | Match baseline |
| Pause + UPFT | 0.3x | +8-18% |
| COCONUT curriculum | 2x | +15-25%? |
7. Open Questions
- Does random perturbation scale to 27B? Tested on 4B - effect may differ
- Optimal token count for 27B? 2 optimal for 4B, might change
- DeltaNet interaction? 75% linear attention is untested territory
- Composition effects? Prefix + steering + pause tokens together?
- GDN as continuous thought channel? Novel research direction
8. Risk Assessment
| Risk | Mitigation |
|---|---|
| No improvement at 27B scale | Start with inference-time validation |
| Training instability with pause tokens | Start with 2 tokens, scale up |
| UPFT doesn't transfer | Fall back to full token loss |
| DeltaNet behaves differently | Ablate on attention-only layers first |
9. Timeline Estimate
| Phase | Duration | Deliverable |
|---|---|---|
| Embedding RMS + baseline | 1 day | Numbers |
| Random prefix validation | 1 day | Inference results |
| Pause token implementation | 2 days | Crane modification |
| UPFT integration | 1 day | Training loop change |
| First finetuning run | 2-3 days | Trained model |
| Evaluation | 1 day | Benchmark numbers |
| COCONUT curriculum | 1 week | Staged training |
10. References
Primary Sources
- Random Prefix: https://github.com/dl1683/Latent-Space-Reasoning
- Attention Sinks: Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (Sept 2023)
- Pause Tokens: Google, "Think before you speak" (Oct 2023)
- COCONUT: Meta, "Training Large Language Models to Reason in a Continuous Latent Space" (Dec 2024)
- UPFT: "Prefix Self-Consistency for Unsupervised Fine-Tuning" (Mar 2025)
- ActAdd: Turner et al., "Activation Addition: Steering Language Models Without Optimization" (Aug 2023)
- Recurrent Depth: Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning" (Feb 2025)
- Ouro: ByteDance, "Ouro: Scaling Reasoning with Latent Thoughts" (2025)
- Planning Tokens: ICLR 2024
Our Existing Work
steering-vector-empirical- listening vector extractionskills-apollo-optimizer-qwen35-gotcha- APOLLO parameter groupingqwen-3-5-27b-architecture-findings- model architecture detailstraining-pipeline-fused-inference-training-mar27- training infrastructure
Research complete 2026-04-12. Ready for implementation.