# Latent Reasoning Integration Plan for Qwen 3.5 27B **Status:** Research complete, ready for implementation **Date:** 2026-04-12 **Hardware:** B200 (192GB HBM3e), APOLLO-Mini optimizer ## Executive Summary Recent research shows multiple approaches to improving LLM reasoning through latent space manipulation. This document synthesizes findings from 10+ papers and maps them to our Qwen 3.5 27B full finetuning pipeline. The key insight: some approaches require pretraining from scratch (skip those), while others can be layered onto existing models during finetuning (prioritize those). --- ## 1. The Landscape ### Approaches That Require Pretraining (Not Applicable) | Technique | Why Not | |-----------|---------| | Huginn/Recurrent Depth (Geiping 2025) | Requires architectural changes from scratch | | Ouro/LoopLM (ByteDance 2025) | Needs weight-tied looped architecture | | Quiet-STaR (Stanford 2024) | Heavy continued pretraining overhead | ### Approaches Compatible with Finetuning (Our Focus) | Technique | Overhead | Training Required | Proven On | |-----------|----------|-------------------|-----------| | Random Prefix Perturbation | 2 tokens | None (inference) | Qwen3-4B | | Pause/Planning Tokens | 2-4 tokens | Yes | 1B models | | COCONUT Curriculum | Variable | Yes (staged) | General | | ActAdd Steering Vectors | 1 vector/layer | None (inference) | LLaMA, OPT | | UPFT (Prefix Fine-Tuning) | 8 tokens | Yes (minimal) | General | --- ## 2. Detailed Technique Analysis ### 2.1 Random Prefix Perturbation (dl1683) **Mechanism:** Prepend 2 random embedding-scale tokens before input. Breaks attention sink patterns, shifts model into "exploratory computation mode." **Results:** - Qwen3-4B arithmetic: 32% → 51.6% (+19.6pp) - 100% oracle coverage on 25/25 tasks - Planning: rescues 14-word failures into 650+ word plans **Why it works:** First few tokens accumulate disproportionate attention (Xiao et al. 2024). Under greedy decoding, degenerate patterns lock in. Perturbation breaks this. **Integration:** Zero training required. Test at inference first, then consider training WITH random prefixes to internalize the exploration behavior. ### 2.2 Pause Tokens (Google, Oct 2023) **Mechanism:** Add learnable pause tokens to embedding space. Model processes extra hidden vectors before committing to output. **Results (1B model):** - SQuAD: +18% EM score - CommonSenseQA: +8% - GSM8K: +1% **Critical requirement:** MUST be both pretrained AND finetuned with pause tokens. Inference-time-only delays don't work without training. **Integration:** Add 2-4 learnable tokens to Qwen's embedding matrix, finetune with them prepended to reasoning prompts. Simple architectural change. ### 2.3 COCONUT - Chain of Continuous Thought (Meta, Dec 2024) **Mechanism:** Feed last hidden state back as next input embedding directly (no decoding to tokens). Enables breadth-first search reasoning. **Why it matters:** Continuous thoughts can encode multiple alternative next steps simultaneously. Avoids premature commitment to single path. **Training approach:** 1. Initial stage: train on regular CoT examples 2. Subsequent stages: replace first k reasoning steps with k×c continuous thoughts 3. c is hyperparameter controlling latent thought expansion **Integration:** Most promising for Qwen 3.5 - curriculum approach from CoT → latent reasoning. ### 2.4 UPFT - Unsupervised Prefix Fine-Tuning (Mar 2025) **Mechanism:** Train ONLY on initial prefix substrings (as few as 8 tokens). Exploits "Prefix Self-Consistency" - shared initial reasoning steps across diverse solutions. **Results:** - Matches Rejection Sampling Fine-Tuning performance - 75% reduction in training time - 99% reduction in sampling cost **Integration:** DIRECTLY APPLICABLE. Train only on reasoning prefix tokens. Massive efficiency gain with APOLLO-Mini. ### 2.5 ActAdd / Activation Engineering (Turner et al., 2023) **Mechanism:** Compute steering vector by contrasting intermediate activations on prompt pairs. Add during forward pass. **Results:** SOTA on sentiment shift and detoxification. **Our existing work:** "Listening" vector at layer 48, magnitude 57, cosine consistency 0.61. **Integration:** Prototype behaviors with steering vectors, then train permanently into weights. Steering vector as specification → APOLLO training as compilation. ### 2.6 Planning Tokens (ICLR 2024) **Mechanism:** Learnable token embeddings added before each reasoning step. <0.001% additional parameters. **Integration:** Add to embedding matrix, train end-to-end with APOLLO. --- ## 3. Our Setup **Model:** Qwen 3.5 27B - 64 layers, 5120 hidden dim - 75% DeltaNet (linear attention) / 25% standard attention - Native 262K context **Hardware:** B200 (192GB HBM3e) - 27B in bf16: ~54GB - Massive headroom **Optimizer:** APOLLO-Mini - Full parameter finetuning - SGD-like memory (1/1024th of AdamW) - Parameter grouping for 3D conv1d weights **Stack:** Crane (Candle-based, 21K lines) **Existing work:** - Steering vector extraction (listening: layer 48, cosine 0.61) - Memory scoring infrastructure **Unique advantage:** Qwen 3.5's GDN (Gated DeltaNet) layers provide natural infrastructure for continuous thought propagation. The recurrent GDN state is already "latent reasoning" infrastructure waiting to be leveraged. --- ## 4. Recommended Implementation Order ### Tier 1: Immediate (High ROI, Low Risk) **1. Pause Tokens + UPFT Combination** - Add 2-4 learnable tokens to embedding space - Train only on 8-token reasoning prefixes - Both work with existing architecture - 75% training time reduction ```python # Add pause tokens to embedding matrix pause_tokens = nn.Parameter(torch.randn(4, embed_dim) * embed_rms) # Prepend to reasoning inputs during training inputs_embeds = torch.cat([pause_tokens.expand(batch, -1, -1), text_embeds], dim=1) # UPFT: only compute loss on first 8 tokens of reasoning loss = loss_fn(logits[:, :8], targets[:, :8]) ``` **2. Random Prefix Validation** - Compute Qwen 3.5 27B embedding RMS - Test 2-token random prefix at inference - Establish baseline before finetuning ### Tier 2: After Baseline (Medium Effort) **3. COCONUT Curriculum** - Stage 1: Fine-tune on CoT examples normally - Stage 2: Replace first reasoning step with continuous thought - Stage 3: Replace first 2 steps - Gradually move reasoning into latent space **4. Steering Vector Integration** - Extract reasoning-specific directions (not just "listening") - Test combinations: prefix + layer-48 steering - Bake successful vectors into weights via APOLLO ### Tier 3: Experimental **5. Multi-layer Steering** - Our layers of interest: 40, 48, 56 (covering the attention layers) - Different vectors per layer - Careful scaling to avoid degradation **6. DeltaNet-Specific Optimization** - The 75% DeltaNet architecture may respond differently - GDN recurrent state as "continuous thought" channel - This is unexplored territory - potential for novel findings --- ## 5. Implementation Details ### Computing Embedding RMS ```python embed_weight = model.get_input_embeddings().weight embed_rms = embed_weight.float().square().mean().sqrt().item() # Expected: ~0.02-0.03 range for Qwen models ``` ### Pause Token Implementation in Crane ```rust // In model forward pass fn forward_with_pause(&self, input_ids: &Tensor, pause_tokens: &Tensor) -> Result { let text_embeds = self.embed_tokens.forward(input_ids)?; let combined = Tensor::cat(&[pause_tokens, &text_embeds], 1)?; self.transformer.forward(&combined) } ``` ### UPFT Loss Modification ```python # Standard: loss over all tokens # UPFT: loss only over prefix tokens def upft_loss(logits, targets, prefix_len=8): return F.cross_entropy( logits[:, :prefix_len].reshape(-1, vocab_size), targets[:, :prefix_len].reshape(-1) ) ``` --- ## 6. Evaluation Plan ### Benchmarks | Benchmark | What It Tests | Baseline Needed | |-----------|---------------|-----------------| | GSM8K | Arithmetic reasoning | Yes | | ARC-Challenge | Science reasoning | Yes | | CommonSenseQA | Commonsense | Yes | | HumanEval | Code generation | Yes | | Planning tasks (dl1683) | Multi-step planning | Yes | ### Comparison Matrix | Configuration | Training Time | Expected Gain | |---------------|---------------|---------------| | Baseline (no prefix) | 1x | 0% | | Random prefix (inference) | 1x | +10-20%? | | Pause tokens (trained) | 1.1x | +8-18% | | UPFT only | 0.25x | Match baseline | | Pause + UPFT | 0.3x | +8-18% | | COCONUT curriculum | 2x | +15-25%? | --- ## 7. Open Questions 1. **Does random perturbation scale to 27B?** Tested on 4B - effect may differ 2. **Optimal token count for 27B?** 2 optimal for 4B, might change 3. **DeltaNet interaction?** 75% linear attention is untested territory 4. **Composition effects?** Prefix + steering + pause tokens together? 5. **GDN as continuous thought channel?** Novel research direction --- ## 8. Risk Assessment | Risk | Mitigation | |------|------------| | No improvement at 27B scale | Start with inference-time validation | | Training instability with pause tokens | Start with 2 tokens, scale up | | UPFT doesn't transfer | Fall back to full token loss | | DeltaNet behaves differently | Ablate on attention-only layers first | --- ## 9. Timeline Estimate | Phase | Duration | Deliverable | |-------|----------|-------------| | Embedding RMS + baseline | 1 day | Numbers | | Random prefix validation | 1 day | Inference results | | Pause token implementation | 2 days | Crane modification | | UPFT integration | 1 day | Training loop change | | First finetuning run | 2-3 days | Trained model | | Evaluation | 1 day | Benchmark numbers | | COCONUT curriculum | 1 week | Staged training | --- ## 10. References ### Primary Sources - Random Prefix: https://github.com/dl1683/Latent-Space-Reasoning - Attention Sinks: Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (Sept 2023) - Pause Tokens: Google, "Think before you speak" (Oct 2023) - COCONUT: Meta, "Training Large Language Models to Reason in a Continuous Latent Space" (Dec 2024) - UPFT: "Prefix Self-Consistency for Unsupervised Fine-Tuning" (Mar 2025) - ActAdd: Turner et al., "Activation Addition: Steering Language Models Without Optimization" (Aug 2023) - Recurrent Depth: Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning" (Feb 2025) - Ouro: ByteDance, "Ouro: Scaling Reasoning with Latent Thoughts" (2025) - Planning Tokens: ICLR 2024 ### Our Existing Work - `steering-vector-empirical` - listening vector extraction - `skills-apollo-optimizer-qwen35-gotcha` - APOLLO parameter grouping - `qwen-3-5-27b-architecture-findings` - model architecture details - `training-pipeline-fused-inference-training-mar27` - training infrastructure --- *Research complete 2026-04-12. Ready for implementation.*