agent: end-to-end gRPC Generate with delta-based session orchestration

Wires the client side of the new salience protocol so inference
actually runs over gRPC instead of emitting the stubbed "not yet
wired" error. Each turn walks the AST as interleaved chunks, sends
only what's new to the server, and streams decode tokens back.

context.rs:
  * `WireChunk` enum: `Tokens(Vec<u32>)` or `Image { bytes, mime,
    known_expanded_len }`. Preserves text/image/text ordering the
    wire path can't flatten.
  * `wire_chunks(range, skip)` walker, parallel to `wire_prompt` —
    branches emit `<|im_start|>…<|im_end|>` tokens, image leaves
    emit a single Image chunk (no inline vision tokens).
  * `NodeLeaf::set_image_token_count(n)` + recompute of cached
    `token_ids`; `ContextState::commit_image_token_counts(&[u32])`
    fills in the first-N zero-count image leaves in wire order.
  * `ResponseParser::run` handles the new
    `StreamToken::ImageAppended` by committing the server's N into
    the AST before the final Generate's Token events stream in.

salience.rs:
  * `SessionHandle` tracks `committed_len`. `append_image` advances
    it from the RPC response. New `generate(req)` opens the
    server-streaming RPC.

api/mod.rs:
  * `stream_session_mm(session_lock, chunks, sampling, priority,
    readout_shape)` replaces the stub. Spawns `run_session_generate`.
  * `run_session_generate`: takes the session out of the Mutex (or
    opens fresh), skips chunks covered by `committed_len` (bails on
    mid-chunk straddle or unknown-length image in the committed
    prefix), walks the delta: accumulates Tokens into `pending`, on
    Image flushes pending via `flush_pending` (max_tokens=0 Generate
    that just prefills), then AppendImage + emits
    StreamToken::ImageAppended. Final Generate carries any trailing
    pending text as `append_tokens` and the sampling params; Token
    events stream out as StreamToken::Token, Done as
    StreamToken::Done. On success, handle with updated
    `committed_len` returns to the Mutex; on error, handle drops
    and next call reopens.
  * `StreamToken::ImageAppended { placeholder_count }` variant —
    emitted in wire order before the final Generate's tokens.
  * Prefix-cache cap for readout coverage: `readout_ranges` covers
    `[prompt_len_after_append, u32::MAX)` when the caller provides
    a readout_shape, so decode positions stream their readouts.

agent/mod.rs:
  * `assemble_prompt` returns `Vec<WireChunk>` with the assistant
    prologue merged into the trailing Tokens chunk. Caller in
    `turn` passes chunks + readout_shape (pulled from
    `agent.readout.lock().manifest`) to `stream_session_mm`.
  * Dropped `assemble_prompt_tokens` — dead.

mind + unconscious:
  * `Unconscious::new(client)` stores a shared `ApiClient`. Fixes
    the repeated-manifest-fetch bug caused by each subagent's
    `ApiClient::new` having its own OnceCell. The client's Arc-
    wrapped manifest cache is now shared across every agent Mind
    spawns.
  * `prepare_spawn(name, auto, wake, base_client)` clones the base
    client and overrides `.model` for the resolved backend instead
    of constructing fresh. All three callers
    (`toggle`/`trigger`/unconscious loop) pass `self.client.clone()`.
  * `Mind::new` passes `agent.client.clone()` into
    `Unconscious::new`.

subconscious/generate.rs:
  * gen_continuation switched to `wire_chunks` + the new
    `stream_session_mm` signature. Ephemeral session opens on each
    call, tears down at scope end. No readouts requested.

Not changed yet, noted for follow-up:
  * Subconscious ablation scoring in learn.rs still talks to
    `/v1/score` over HTTP. Will migrate once we have time to verify
    the Generate+max_tokens=0+prompt_logprobs path end-to-end.
  * compare.rs constructs its own ApiClient for the
    `compare.test_backend` (which is intentionally a different
    endpoint) — left alone.
  * Readout manifest still fetched via HTTP at Agent::new.
    Migration to GetReadoutManifest gRPC is a separate cleanup.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
Kent Overstreet 2026-04-24 12:27:55 -04:00
commit 8d9c9e9f7b
7 changed files with 536 additions and 60 deletions

View file

@ -312,6 +312,16 @@ impl NodeLeaf {
pub fn token_ids(&self) -> &[u32] { &self.token_ids }
pub fn tokens(&self) -> usize { self.token_ids.len() }
pub fn timestamp(&self) -> DateTime<Utc> { self.timestamp }
/// If this is an Image leaf, update its IMAGE_PAD count to `n` and
/// recompute cached `token_ids`. No-op on non-Image leaves —
/// callers know the body shape via `body()`.
pub fn set_image_token_count(&mut self, n: u32) {
if let NodeBody::Image { token_count, .. } = &mut self.body {
*token_count = n;
self.token_ids = self.body.compute_token_ids();
}
}
}
impl AstNode {
@ -737,6 +747,15 @@ impl ResponseParser {
parser.finish(&mut ctx);
return Ok(());
}
super::api::StreamToken::ImageAppended { placeholder_count } => {
// Commit the server-authoritative IMAGE_PAD
// count into the first zero-count image leaf
// in wire order. AppendImage always runs
// before the final Generate, so this fires
// before any Token events for this stream.
let mut ctx = agent.context.lock().await;
ctx.commit_image_token_counts(&[placeholder_count]);
}
super::api::StreamToken::Error(e) => {
return Err(anyhow::anyhow!("{}", e));
}
@ -866,6 +885,36 @@ impl ContextState {
pub fn sections(&self) -> [&Vec<AstNode>; 4] {
[&self.system, &self.identity, &self.journal, &self.conversation]
}
/// Walk image leaves across all sections in wire order and fill in
/// the first N leaves that have `token_count == 0` with successive
/// values from `counts`. Used after a gRPC session's stream of
/// AppendImage responses to commit the server's IMAGE_PAD counts
/// back into the AST so the next wire walk doesn't see zero-count
/// images in the already-committed prefix.
pub fn commit_image_token_counts(&mut self, counts: &[u32]) {
fn visit(node: &mut AstNode, counts: &[u32], idx: &mut usize) {
if *idx >= counts.len() { return; }
match node {
AstNode::Leaf(leaf) => {
if let NodeBody::Image { token_count, .. } = leaf.body() {
if *token_count == 0 {
leaf.set_image_token_count(counts[*idx]);
*idx += 1;
}
}
}
AstNode::Branch { children, .. } => {
for c in children { visit(c, counts, idx); }
}
}
}
let mut idx = 0usize;
for node in &mut self.system { visit(node, counts, &mut idx); }
for node in &mut self.identity { visit(node, counts, &mut idx); }
for node in &mut self.journal { visit(node, counts, &mut idx); }
for node in &mut self.conversation { visit(node, counts, &mut idx); }
}
}
impl Ast for ContextState {
@ -909,6 +958,28 @@ pub struct WireImage {
pub mime: String,
}
/// One piece of the wire stream for the gRPC session path. Runs of
/// text/tool/thinking tokens are batched into `Tokens`; each Image
/// leaf becomes its own `Image` chunk because the server writes the
/// full vision block on AppendImage — the client never sends vision
/// tokens inline. Order matches the AST's depth-first wire order.
#[derive(Clone)]
pub enum WireChunk {
Tokens(Vec<u32>),
Image {
bytes: Vec<u8>,
mime: String,
/// Client's current best guess at how many tokens the server
/// will expand this image to, including bookends. `0` means
/// the count is unknown (view_image just loaded the image and
/// AppendImage hasn't run yet). Callers use this only to know
/// this chunk's contribution to the server-visible length for
/// offset bookkeeping on chunks that were already appended on
/// a prior turn.
known_expanded_len: u32,
},
}
fn wire_into(node: &AstNode, tokens: &mut Vec<u32>, images: &mut Vec<WireImage>) {
match node {
AstNode::Leaf(leaf) => match leaf.body() {
@ -1045,6 +1116,80 @@ impl ContextState {
}
(tokens, images, assistant_ranges)
}
/// Build the wire stream as interleaved `WireChunk`s for the gRPC
/// session path. Unlike `wire_prompt`, this preserves the order
/// of text runs vs image blocks so the caller can drive the
/// append flow (AppendImage for each Image, Generate append for
/// contiguous text runs).
///
/// `conv_range` and `skip` mirror `wire_prompt` — select a
/// conversation slice and drop identity / conversation nodes by
/// predicate.
pub fn wire_chunks<F>(
&self,
conv_range: std::ops::Range<usize>,
mut skip: F,
) -> Vec<WireChunk>
where F: FnMut(&AstNode) -> bool,
{
let mut out: Vec<WireChunk> = Vec::new();
let mut buf: Vec<u32> = Vec::new();
fn flush(buf: &mut Vec<u32>, out: &mut Vec<WireChunk>) {
if !buf.is_empty() {
out.push(WireChunk::Tokens(std::mem::take(buf)));
}
}
fn visit(node: &AstNode, buf: &mut Vec<u32>, out: &mut Vec<WireChunk>) {
match node {
AstNode::Leaf(leaf) => match leaf.body() {
NodeBody::Image { bytes, mime, token_count, .. } => {
flush(buf, out);
// Bookends (VISION_START + VISION_END) add 2
// to the expanded length; token_count is the
// IMAGE_PAD run. 0 means count is still
// unknown (no AppendImage yet) — don't claim
// a length the server will disagree with.
let expanded = if *token_count == 0 {
0
} else {
*token_count + 2
};
out.push(WireChunk::Image {
bytes: bytes.clone(),
mime: mime.clone(),
known_expanded_len: expanded,
});
}
_ => buf.extend_from_slice(leaf.token_ids()),
},
AstNode::Branch { role, children, .. } => {
buf.push(tokenizer::IM_START);
buf.extend(tokenizer::encode(&format!("{}\n", role.as_str())));
for c in children {
visit(c, buf, out);
}
buf.push(tokenizer::IM_END);
buf.extend(tokenizer::encode("\n"));
}
}
}
for node in self.system() { visit(node, &mut buf, &mut out); }
for node in self.identity() {
if skip(node) { continue; }
visit(node, &mut buf, &mut out);
}
for node in self.journal() { visit(node, &mut buf, &mut out); }
for node in &self.conversation()[conv_range] {
if skip(node) { continue; }
visit(node, &mut buf, &mut out);
}
flush(&mut buf, &mut out);
out
}
}
impl ContextState {