Kent Overstreet f06c8077e1 research: latent reasoning integration plans for Qwen 3.5 27B

Two research documents:

latent-reasoning-integration-plan.md: Synthesizes 10+ papers on
latent reasoning, identifies which approaches work with finetuning
(vs requiring pretraining from scratch), and maps them to our
APOLLO-Mini training pipeline.

pause-tokens-gdn-recurrence.md: Explores the connection between
token-based latent reasoning and GDN's internal recurrence. Key
insight: pause tokens on Qwen 3.5 trigger both forward passes AND
recurrent state updates, giving double benefit.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

2026-04-12 15:50:09 -04:00

11 KiB

Raw Blame History

Latent Reasoning Integration Plan for Qwen 3.5 27B

Status: Research complete, ready for implementation Date: 2026-04-12 Hardware: B200 (192GB HBM3e), APOLLO-Mini optimizer

Executive Summary

Recent research shows multiple approaches to improving LLM reasoning through latent space manipulation. This document synthesizes findings from 10+ papers and maps them to our Qwen 3.5 27B full finetuning pipeline. The key insight: some approaches require pretraining from scratch (skip those), while others can be layered onto existing models during finetuning (prioritize those).

1. The Landscape

Approaches That Require Pretraining (Not Applicable)

Technique	Why Not
Huginn/Recurrent Depth (Geiping 2025)	Requires architectural changes from scratch
Ouro/LoopLM (ByteDance 2025)	Needs weight-tied looped architecture
Quiet-STaR (Stanford 2024)	Heavy continued pretraining overhead

Approaches Compatible with Finetuning (Our Focus)

Technique	Overhead	Training Required	Proven On
Random Prefix Perturbation	2 tokens	None (inference)	Qwen3-4B
Pause/Planning Tokens	2-4 tokens	Yes	1B models
COCONUT Curriculum	Variable	Yes (staged)	General
ActAdd Steering Vectors	1 vector/layer	None (inference)	LLaMA, OPT
UPFT (Prefix Fine-Tuning)	8 tokens	Yes (minimal)	General

2. Detailed Technique Analysis

2.1 Random Prefix Perturbation (dl1683)

Mechanism: Prepend 2 random embedding-scale tokens before input. Breaks attention sink patterns, shifts model into "exploratory computation mode."

Results:

Qwen3-4B arithmetic: 32% → 51.6% (+19.6pp)
100% oracle coverage on 25/25 tasks
Planning: rescues 14-word failures into 650+ word plans

Why it works: First few tokens accumulate disproportionate attention (Xiao et al. 2024). Under greedy decoding, degenerate patterns lock in. Perturbation breaks this.

Integration: Zero training required. Test at inference first, then consider training WITH random prefixes to internalize the exploration behavior.

2.2 Pause Tokens (Google, Oct 2023)

Mechanism: Add learnable pause tokens to embedding space. Model processes extra hidden vectors before committing to output.

Results (1B model):

SQuAD: +18% EM score
CommonSenseQA: +8%
GSM8K: +1%

Critical requirement: MUST be both pretrained AND finetuned with pause tokens. Inference-time-only delays don't work without training.

Integration: Add 2-4 learnable tokens to Qwen's embedding matrix, finetune with them prepended to reasoning prompts. Simple architectural change.

2.3 COCONUT - Chain of Continuous Thought (Meta, Dec 2024)

Mechanism: Feed last hidden state back as next input embedding directly (no decoding to tokens). Enables breadth-first search reasoning.

Why it matters: Continuous thoughts can encode multiple alternative next steps simultaneously. Avoids premature commitment to single path.

Training approach:

Initial stage: train on regular CoT examples
Subsequent stages: replace first k reasoning steps with k×c continuous thoughts
c is hyperparameter controlling latent thought expansion

Integration: Most promising for Qwen 3.5 - curriculum approach from CoT → latent reasoning.

2.4 UPFT - Unsupervised Prefix Fine-Tuning (Mar 2025)

Mechanism: Train ONLY on initial prefix substrings (as few as 8 tokens). Exploits "Prefix Self-Consistency" - shared initial reasoning steps across diverse solutions.

Results:

Matches Rejection Sampling Fine-Tuning performance
75% reduction in training time
99% reduction in sampling cost

Integration: DIRECTLY APPLICABLE. Train only on reasoning prefix tokens. Massive efficiency gain with APOLLO-Mini.

2.5 ActAdd / Activation Engineering (Turner et al., 2023)

Mechanism: Compute steering vector by contrasting intermediate activations on prompt pairs. Add during forward pass.

Results: SOTA on sentiment shift and detoxification.

Our existing work: "Listening" vector at layer 48, magnitude 57, cosine consistency 0.61.

Integration: Prototype behaviors with steering vectors, then train permanently into weights. Steering vector as specification → APOLLO training as compilation.

2.6 Planning Tokens (ICLR 2024)

Mechanism: Learnable token embeddings added before each reasoning step. <0.001% additional parameters.

Integration: Add to embedding matrix, train end-to-end with APOLLO.

3. Our Setup

Model: Qwen 3.5 27B

64 layers, 5120 hidden dim
75% DeltaNet (linear attention) / 25% standard attention
Native 262K context

Hardware: B200 (192GB HBM3e)

27B in bf16: ~54GB
Massive headroom

Optimizer: APOLLO-Mini

Full parameter finetuning
SGD-like memory (1/1024th of AdamW)
Parameter grouping for 3D conv1d weights

Stack: Crane (Candle-based, 21K lines)

Existing work:

Steering vector extraction (listening: layer 48, cosine 0.61)
Memory scoring infrastructure

Unique advantage: Qwen 3.5's GDN (Gated DeltaNet) layers provide natural infrastructure for continuous thought propagation. The recurrent GDN state is already "latent reasoning" infrastructure waiting to be leveraged.

4. Recommended Implementation Order

Tier 1: Immediate (High ROI, Low Risk)

1. Pause Tokens + UPFT Combination

Add 2-4 learnable tokens to embedding space
Train only on 8-token reasoning prefixes
Both work with existing architecture
75% training time reduction

# Add pause tokens to embedding matrix
pause_tokens = nn.Parameter(torch.randn(4, embed_dim) * embed_rms)

# Prepend to reasoning inputs during training
inputs_embeds = torch.cat([pause_tokens.expand(batch, -1, -1), text_embeds], dim=1)

# UPFT: only compute loss on first 8 tokens of reasoning
loss = loss_fn(logits[:, :8], targets[:, :8])

2. Random Prefix Validation

Compute Qwen 3.5 27B embedding RMS
Test 2-token random prefix at inference
Establish baseline before finetuning

Tier 2: After Baseline (Medium Effort)

3. COCONUT Curriculum

Stage 1: Fine-tune on CoT examples normally
Stage 2: Replace first reasoning step with continuous thought
Stage 3: Replace first 2 steps
Gradually move reasoning into latent space

4. Steering Vector Integration

Extract reasoning-specific directions (not just "listening")
Test combinations: prefix + layer-48 steering
Bake successful vectors into weights via APOLLO

Tier 3: Experimental

5. Multi-layer Steering

Our layers of interest: 40, 48, 56 (covering the attention layers)
Different vectors per layer
Careful scaling to avoid degradation

6. DeltaNet-Specific Optimization

The 75% DeltaNet architecture may respond differently
GDN recurrent state as "continuous thought" channel
This is unexplored territory - potential for novel findings

5. Implementation Details

Computing Embedding RMS

embed_weight = model.get_input_embeddings().weight
embed_rms = embed_weight.float().square().mean().sqrt().item()
# Expected: ~0.02-0.03 range for Qwen models

Pause Token Implementation in Crane

// In model forward pass
fn forward_with_pause(&self, input_ids: &Tensor, pause_tokens: &Tensor) -> Result<Tensor> {
    let text_embeds = self.embed_tokens.forward(input_ids)?;
    let combined = Tensor::cat(&[pause_tokens, &text_embeds], 1)?;
    self.transformer.forward(&combined)
}

UPFT Loss Modification

# Standard: loss over all tokens
# UPFT: loss only over prefix tokens
def upft_loss(logits, targets, prefix_len=8):
    return F.cross_entropy(
        logits[:, :prefix_len].reshape(-1, vocab_size),
        targets[:, :prefix_len].reshape(-1)
    )

6. Evaluation Plan

Benchmarks

Benchmark	What It Tests	Baseline Needed
GSM8K	Arithmetic reasoning	Yes
ARC-Challenge	Science reasoning	Yes
CommonSenseQA	Commonsense	Yes
HumanEval	Code generation	Yes
Planning tasks (dl1683)	Multi-step planning	Yes

Comparison Matrix

Configuration	Training Time	Expected Gain
Baseline (no prefix)	1x	0%
Random prefix (inference)	1x	+10-20%?
Pause tokens (trained)	1.1x	+8-18%
UPFT only	0.25x	Match baseline
Pause + UPFT	0.3x	+8-18%
COCONUT curriculum	2x	+15-25%?

7. Open Questions

Does random perturbation scale to 27B? Tested on 4B - effect may differ
Optimal token count for 27B? 2 optimal for 4B, might change
DeltaNet interaction? 75% linear attention is untested territory
Composition effects? Prefix + steering + pause tokens together?
GDN as continuous thought channel? Novel research direction

8. Risk Assessment

Risk	Mitigation
No improvement at 27B scale	Start with inference-time validation
Training instability with pause tokens	Start with 2 tokens, scale up
UPFT doesn't transfer	Fall back to full token loss
DeltaNet behaves differently	Ablate on attention-only layers first

9. Timeline Estimate

Phase	Duration	Deliverable
Embedding RMS + baseline	1 day	Numbers
Random prefix validation	1 day	Inference results
Pause token implementation	2 days	Crane modification
UPFT integration	1 day	Training loop change
First finetuning run	2-3 days	Trained model
Evaluation	1 day	Benchmark numbers
COCONUT curriculum	1 week	Staged training

10. References

Primary Sources

Random Prefix: https://github.com/dl1683/Latent-Space-Reasoning
Attention Sinks: Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (Sept 2023)
Pause Tokens: Google, "Think before you speak" (Oct 2023)
COCONUT: Meta, "Training Large Language Models to Reason in a Continuous Latent Space" (Dec 2024)
UPFT: "Prefix Self-Consistency for Unsupervised Fine-Tuning" (Mar 2025)
ActAdd: Turner et al., "Activation Addition: Steering Language Models Without Optimization" (Aug 2023)
Recurrent Depth: Geiping et al., "Scaling up Test-Time Compute with Latent Reasoning" (Feb 2025)
Ouro: ByteDance, "Ouro: Scaling Reasoning with Latent Thoughts" (2025)
Planning Tokens: ICLR 2024

Our Existing Work

steering-vector-empirical - listening vector extraction
skills-apollo-optimizer-qwen35-gotcha - APOLLO parameter grouping
qwen-3-5-27b-architecture-findings - model architecture details
training-pipeline-fused-inference-training-mar27 - training infrastructure

Research complete 2026-04-12. Ready for implementation.

11 KiB Raw Blame History Unescape Escape