agent: end-to-end gRPC Generate with delta-based session orchestration

Wires the client side of the new salience protocol so inference actually runs over gRPC instead of emitting the stubbed "not yet wired" error. Each turn walks the AST as interleaved chunks, sends only what's new to the server, and streams decode tokens back. context.rs: * `WireChunk` enum: `Tokens(Vec<u32>)` or `Image { bytes, mime, known_expanded_len }`. Preserves text/image/text ordering the wire path can't flatten. * `wire_chunks(range, skip)` walker, parallel to `wire_prompt` — branches emit `<|im_start|>…<|im_end|>` tokens, image leaves emit a single Image chunk (no inline vision tokens). * `NodeLeaf::set_image_token_count(n)` + recompute of cached `token_ids`; `ContextState::commit_image_token_counts(&[u32])` fills in the first-N zero-count image leaves in wire order. * `ResponseParser::run` handles the new `StreamToken::ImageAppended` by committing the server's N into the AST before the final Generate's Token events stream in. salience.rs: * `SessionHandle` tracks `committed_len`. `append_image` advances it from the RPC response. New `generate(req)` opens the server-streaming RPC. api/mod.rs: * `stream_session_mm(session_lock, chunks, sampling, priority, readout_shape)` replaces the stub. Spawns `run_session_generate`. * `run_session_generate`: takes the session out of the Mutex (or opens fresh), skips chunks covered by `committed_len` (bails on mid-chunk straddle or unknown-length image in the committed prefix), walks the delta: accumulates Tokens into `pending`, on Image flushes pending via `flush_pending` (max_tokens=0 Generate that just prefills), then AppendImage + emits StreamToken::ImageAppended. Final Generate carries any trailing pending text as `append_tokens` and the sampling params; Token events stream out as StreamToken::Token, Done as StreamToken::Done. On success, handle with updated `committed_len` returns to the Mutex; on error, handle drops and next call reopens. * `StreamToken::ImageAppended { placeholder_count }` variant — emitted in wire order before the final Generate's tokens. * Prefix-cache cap for readout coverage: `readout_ranges` covers `[prompt_len_after_append, u32::MAX)` when the caller provides a readout_shape, so decode positions stream their readouts. agent/mod.rs: * `assemble_prompt` returns `Vec<WireChunk>` with the assistant prologue merged into the trailing Tokens chunk. Caller in `turn` passes chunks + readout_shape (pulled from `agent.readout.lock().manifest`) to `stream_session_mm`. * Dropped `assemble_prompt_tokens` — dead. mind + unconscious: * `Unconscious::new(client)` stores a shared `ApiClient`. Fixes the repeated-manifest-fetch bug caused by each subagent's `ApiClient::new` having its own OnceCell. The client's Arc- wrapped manifest cache is now shared across every agent Mind spawns. * `prepare_spawn(name, auto, wake, base_client)` clones the base client and overrides `.model` for the resolved backend instead of constructing fresh. All three callers (`toggle`/`trigger`/unconscious loop) pass `self.client.clone()`. * `Mind::new` passes `agent.client.clone()` into `Unconscious::new`. subconscious/generate.rs: * gen_continuation switched to `wire_chunks` + the new `stream_session_mm` signature. Ephemeral session opens on each call, tears down at scope end. No readouts requested. Not changed yet, noted for follow-up: * Subconscious ablation scoring in learn.rs still talks to `/v1/score` over HTTP. Will migrate once we have time to verify the Generate+max_tokens=0+prompt_logprobs path end-to-end. * compare.rs constructs its own ApiClient for the `compare.test_backend` (which is intentionally a different endpoint) — left alone. * Readout manifest still fetched via HTTP at Agent::new. Migration to GetReadoutManifest gRPC is a separate cleanup. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:27:55 -04:00 · 2026-04-24 12:27:55 -04:00 · 8d9c9e9f7b
commit 8d9c9e9f7b
parent 08213f9093
7 changed files with 536 additions and 60 deletions
--- a/src/subconscious/generate.rs
+++ b/src/subconscious/generate.rs
@ -7,7 +7,7 @@
 use std::sync::Arc;

 use crate::agent::api::{ApiClient, SamplingParams, StreamToken};
-use crate::agent::context::{AstNode, ContextState};
+use crate::agent::context::{AstNode, ContextState, WireChunk};
 use crate::agent::tokenizer;

 /// Generate an assistant continuation from the context up to `entry_idx`,
@ -26,10 +26,18 @@ pub async fn gen_continuation<F>(
 ) -> anyhow::Result<String>
 where F: FnMut(&AstNode) -> bool,
 {
-    let (mut prompt, images, _) = context.wire_prompt(0..entry_idx, skip);
+    let mut chunks = context.wire_chunks(0..entry_idx, skip);

-    prompt.push(tokenizer::IM_START);
-    prompt.extend(tokenizer::encode("assistant\n"));
+    // Assistant-turn prologue.
+    let prologue = {
+        let mut t = vec![tokenizer::IM_START];
+        t.extend(tokenizer::encode("assistant\n"));
+        t
+    };
+    match chunks.last_mut() {
+        Some(WireChunk::Tokens(last)) => last.extend(prologue),
+        _ => chunks.push(WireChunk::Tokens(prologue)),
+    }

    let sampling = SamplingParams {
        temperature: 0.6,
@ -41,13 +49,19 @@ where F: FnMut(&AstNode) -> bool,
    // `_guard` drops at function end.
    let session_lock = Arc::new(crate::Mutex::new(None));
    let (mut rx, _guard) = client.stream_session_mm(
-        session_lock, &prompt, &images, sampling, Some(-5),
+        session_lock, chunks, sampling, Some(-5), None,
    );

    let mut tokens = Vec::new();
    while let Some(tok) = rx.recv().await {
        match tok {
            StreamToken::Token { id, .. } => tokens.push(id),
+            StreamToken::ImageAppended { .. } => {
+                // subconscious/generate uses wire_chunks over an AST
+                // slice that shouldn't have unsized images — but if
+                // it ever does, we just don't care about updating the
+                // ephemeral session's AST view.
+            }
            StreamToken::Done { .. } => break,
            StreamToken::Error(e) => anyhow::bail!("generation error: {}", e),
        }