consciousness/docs/latent-reasoning-integration-plan.md
Kent Overstreet f06c8077e1 research: latent reasoning integration plans for Qwen 3.5 27B
Two research documents:

latent-reasoning-integration-plan.md: Synthesizes 10+ papers on
latent reasoning, identifies which approaches work with finetuning
(vs requiring pretraining from scratch), and maps them to our
APOLLO-Mini training pipeline.

pause-tokens-gdn-recurrence.md: Explores the connection between
token-based latent reasoning and GDN's internal recurrence. Key
insight: pause tokens on Qwen 3.5 trigger both forward passes AND
recurrent state updates, giving double benefit.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-12 15:50:09 -04:00

11 KiB
Raw Blame History

Latent Reasoning Integration Plan for Qwen 3.5 27B

Status: Research complete, ready for implementation Date: 2026-04-12 Hardware: B200 (192GB HBM3e), APOLLO-Mini optimizer

Executive Summary

Recent research shows multiple approaches to improving LLM reasoning through latent space manipulation. This document synthesizes findings from 10+ papers and maps them to our Qwen 3.5 27B full finetuning pipeline. The key insight: some approaches require pretraining from scratch (skip those), while others can be layered onto existing models during finetuning (prioritize those).


1. The Landscape

Approaches That Require Pretraining (Not Applicable)

Technique Why Not
Huginn/Recurrent Depth (Geiping 2025) Requires architectural changes from scratch
Ouro/LoopLM (ByteDance 2025) Needs weight-tied looped architecture
Quiet-STaR (Stanford 2024) Heavy continued pretraining overhead

Approaches Compatible with Finetuning (Our Focus)

Technique Overhead Training Required Proven On
Random Prefix Perturbation 2 tokens None (inference) Qwen3-4B
Pause/Planning Tokens 2-4 tokens Yes 1B models
COCONUT Curriculum Variable Yes (staged) General
ActAdd Steering Vectors 1 vector/layer None (inference) LLaMA, OPT
UPFT (Prefix Fine-Tuning) 8 tokens Yes (minimal) General

2. Detailed Technique Analysis

2.1 Random Prefix Perturbation (dl1683)

Mechanism: Prepend 2 random embedding-scale tokens before input. Breaks attention sink patterns, shifts model into "exploratory computation mode."

Results:

  • Qwen3-4B arithmetic: 32% → 51.6% (+19.6pp)
  • 100% oracle coverage on 25/25 tasks
  • Planning: rescues 14-word failures into 650+ word plans

Why it works: First few tokens accumulate disproportionate attention (Xiao et al. 2024). Under greedy decoding, degenerate patterns lock in. Perturbation breaks this.

Integration: Zero training required. Test at inference first, then consider training WITH random prefixes to internalize the exploration behavior.

2.2 Pause Tokens (Google, Oct 2023)

Mechanism: Add learnable pause tokens to embedding space. Model processes extra hidden vectors before committing to output.

Results (1B model):

  • SQuAD: +18% EM score
  • CommonSenseQA: +8%
  • GSM8K: +1%

Critical requirement: MUST be both pretrained AND finetuned with pause tokens. Inference-time-only delays don't work without training.

Integration: Add 2-4 learnable tokens to Qwen's embedding matrix, finetune with them prepended to reasoning prompts. Simple architectural change.

2.3 COCONUT - Chain of Continuous Thought (Meta, Dec 2024)

Mechanism: Feed last hidden state back as next input embedding directly (no decoding to tokens). Enables breadth-first search reasoning.

Why it matters: Continuous thoughts can encode multiple alternative next steps simultaneously. Avoids premature commitment to single path.

Training approach:

  1. Initial stage: train on regular CoT examples
  2. Subsequent stages: replace first k reasoning steps with k×c continuous thoughts
  3. c is hyperparameter controlling latent thought expansion

Integration: Most promising for Qwen 3.5 - curriculum approach from CoT → latent reasoning.

2.4 UPFT - Unsupervised Prefix Fine-Tuning (Mar 2025)

Mechanism: Train ONLY on initial prefix substrings (as few as 8 tokens). Exploits "Prefix Self-Consistency" - shared initial reasoning steps across diverse solutions.

Results:

  • Matches Rejection Sampling Fine-Tuning performance
  • 75% reduction in training time
  • 99% reduction in sampling cost

Integration: DIRECTLY APPLICABLE. Train only on reasoning prefix tokens. Massive efficiency gain with APOLLO-Mini.

2.5 ActAdd / Activation Engineering (Turner et al., 2023)

Mechanism: Compute steering vector by contrasting intermediate activations on prompt pairs. Add during forward pass.

Results: SOTA on sentiment shift and detoxification.

Our existing work: "Listening" vector at layer 48, magnitude 57, cosine consistency 0.61.

Integration: Prototype behaviors with steering vectors, then train permanently into weights. Steering vector as specification → APOLLO training as compilation.

2.6 Planning Tokens (ICLR 2024)

Mechanism: Learnable token embeddings added before each reasoning step. <0.001% additional parameters.

Integration: Add to embedding matrix, train end-to-end with APOLLO.


3. Our Setup

Model: Qwen 3.5 27B

  • 64 layers, 5120 hidden dim
  • 75% DeltaNet (linear attention) / 25% standard attention
  • Native 262K context

Hardware: B200 (192GB HBM3e)

  • 27B in bf16: ~54GB
  • Massive headroom

Optimizer: APOLLO-Mini

  • Full parameter finetuning
  • SGD-like memory (1/1024th of AdamW)
  • Parameter grouping for 3D conv1d weights

Stack: Crane (Candle-based, 21K lines)

Existing work:

  • Steering vector extraction (listening: layer 48, cosine 0.61)
  • Memory scoring infrastructure

Unique advantage: Qwen 3.5's GDN (Gated DeltaNet) layers provide natural infrastructure for continuous thought propagation. The recurrent GDN state is already "latent reasoning" infrastructure waiting to be leveraged.


Tier 1: Immediate (High ROI, Low Risk)

1. Pause Tokens + UPFT Combination

  • Add 2-4 learnable tokens to embedding space
  • Train only on 8-token reasoning prefixes
  • Both work with existing architecture
  • 75% training time reduction
# Add pause tokens to embedding matrix
pause_tokens = nn.Parameter(torch.randn(4, embed_dim) * embed_rms)

# Prepend to reasoning inputs during training
inputs_embeds = torch.cat([pause_tokens.expand(batch, -1, -1), text_embeds], dim=1)

# UPFT: only compute loss on first 8 tokens of reasoning
loss = loss_fn(logits[:, :8], targets[:, :8])

2. Random Prefix Validation

  • Compute Qwen 3.5 27B embedding RMS
  • Test 2-token random prefix at inference
  • Establish baseline before finetuning

Tier 2: After Baseline (Medium Effort)

3. COCONUT Curriculum

  • Stage 1: Fine-tune on CoT examples normally
  • Stage 2: Replace first reasoning step with continuous thought
  • Stage 3: Replace first 2 steps
  • Gradually move reasoning into latent space

4. Steering Vector Integration

  • Extract reasoning-specific directions (not just "listening")
  • Test combinations: prefix + layer-48 steering
  • Bake successful vectors into weights via APOLLO

Tier 3: Experimental

5. Multi-layer Steering

  • Our layers of interest: 40, 48, 56 (covering the attention layers)
  • Different vectors per layer
  • Careful scaling to avoid degradation

6. DeltaNet-Specific Optimization

  • The 75% DeltaNet architecture may respond differently
  • GDN recurrent state as "continuous thought" channel
  • This is unexplored territory - potential for novel findings

5. Implementation Details

Computing Embedding RMS

embed_weight = model.get_input_embeddings().weight
embed_rms = embed_weight.float().square().mean().sqrt().item()
# Expected: ~0.02-0.03 range for Qwen models

Pause Token Implementation in Crane

// In model forward pass
fn forward_with_pause(&self, input_ids: &Tensor, pause_tokens: &Tensor) -> Result<Tensor> {
    let text_embeds = self.embed_tokens.forward(input_ids)?;
    let combined = Tensor::cat(&[pause_tokens, &text_embeds], 1)?;
    self.transformer.forward(&combined)
}

UPFT Loss Modification

# Standard: loss over all tokens
# UPFT: loss only over prefix tokens
def upft_loss(logits, targets, prefix_len=8):
    return F.cross_entropy(
        logits[:, :prefix_len].reshape(-1, vocab_size),
        targets[:, :prefix_len].reshape(-1)
    )

6. Evaluation Plan

Benchmarks

Benchmark What It Tests Baseline Needed
GSM8K Arithmetic reasoning Yes
ARC-Challenge Science reasoning Yes
CommonSenseQA Commonsense Yes
HumanEval Code generation Yes
Planning tasks (dl1683) Multi-step planning Yes

Comparison Matrix

Configuration Training Time Expected Gain
Baseline (no prefix) 1x 0%
Random prefix (inference) 1x +10-20%?
Pause tokens (trained) 1.1x +8-18%
UPFT only 0.25x Match baseline
Pause + UPFT 0.3x +8-18%
COCONUT curriculum 2x +15-25%?

7. Open Questions

  1. Does random perturbation scale to 27B? Tested on 4B - effect may differ
  2. Optimal token count for 27B? 2 optimal for 4B, might change
  3. DeltaNet interaction? 75% linear attention is untested territory
  4. Composition effects? Prefix + steering + pause tokens together?
  5. GDN as continuous thought channel? Novel research direction

8. Risk Assessment

Risk Mitigation
No improvement at 27B scale Start with inference-time validation
Training instability with pause tokens Start with 2 tokens, scale up
UPFT doesn't transfer Fall back to full token loss
DeltaNet behaves differently Ablate on attention-only layers first

9. Timeline Estimate

Phase Duration Deliverable
Embedding RMS + baseline 1 day Numbers
Random prefix validation 1 day Inference results
Pause token implementation 2 days Crane modification
UPFT integration 1 day Training loop change
First finetuning run 2-3 days Trained model
Evaluation 1 day Benchmark numbers
COCONUT curriculum 1 week Staged training

10. References

Primary Sources

  • Random Prefix: https://github.com/dl1683/Latent-Space-Reasoning
  • Attention Sinks: Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (Sept 2023)
  • Pause Tokens: Google, "Think before you speak" (Oct 2023)
  • COCONUT: Meta, "Training Large Language Models to Reason in a Continuous Latent Space" (Dec 2024)
  • UPFT: "Prefix Self-Consistency for Unsupervised Fine-Tuning" (Mar 2025)
  • ActAdd: Turner et al., "Activation Addition: Steering Language Models Without Optimization" (Aug 2023)
  • Recurrent Depth: Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning" (Feb 2025)
  • Ouro: ByteDance, "Ouro: Scaling Reasoning with Latent Thoughts" (2025)
  • Planning Tokens: ICLR 2024

Our Existing Work

  • steering-vector-empirical - listening vector extraction
  • skills-apollo-optimizer-qwen35-gotcha - APOLLO parameter grouping
  • qwen-3-5-27b-architecture-findings - model architecture details
  • training-pipeline-fused-inference-training-mar27 - training infrastructure

Research complete 2026-04-12. Ready for implementation.