salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
// agent/api/salience.rs — gRPC client bindings for salience.v1.
|
|
|
|
|
//
|
|
|
|
|
// Thin wrapper around the tonic-generated types. Every RPC except
|
|
|
|
|
// Generate is unary; Generate is server-streaming. Free functions
|
|
|
|
|
// (open/close session) wrap the lifecycle RPCs; `SessionHandle` just
|
|
|
|
|
// carries the id + connection params so later RPCs can reuse them.
|
|
|
|
|
//
|
|
|
|
|
// The old bidi Session() API is gone — see git history for its shape.
|
|
|
|
|
|
|
|
|
|
#![allow(clippy::enum_variant_names)]
|
|
|
|
|
|
|
|
|
|
use anyhow::{Context, Result};
|
|
|
|
|
use tonic::transport::{Certificate, Channel, ClientTlsConfig, Endpoint};
|
|
|
|
|
|
|
|
|
|
/// Generated prost + tonic types for salience.v1. Call sites use
|
|
|
|
|
/// `pb::OpenSessionRequest`, `pb::Token`, etc.
|
|
|
|
|
pub mod pb {
|
|
|
|
|
tonic::include_proto!("salience.v1");
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
pub type SalienceClient = pb::salience_client::SalienceClient<Channel>;
|
|
|
|
|
|
|
|
|
|
/// Open a TLS-aware gRPC channel to the salience server. `base_url`
|
|
|
|
|
/// looks like `https://host:8443`. User-provided CA certs under
|
|
|
|
|
/// `~/.consciousness/certs/` are trusted in addition to the system
|
|
|
|
|
/// roots (for self-signed server certs).
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
///
|
|
|
|
|
/// Returns the raw `Channel` so callers (`ApiClient::salience_client`)
|
|
|
|
|
/// can cache it and clone a `SalienceClient` per request without
|
|
|
|
|
/// reopening the TCP/TLS connection. tonic multiplexes RPCs over the
|
|
|
|
|
/// shared channel automatically.
|
|
|
|
|
pub async fn connect_channel(base_url: &str) -> Result<Channel> {
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
let mut endpoint = Endpoint::from_shared(base_url.to_string())
|
|
|
|
|
.with_context(|| format!("invalid salience endpoint: {}", base_url))?
|
|
|
|
|
.connect_timeout(std::time::Duration::from_secs(30))
|
|
|
|
|
.timeout(std::time::Duration::from_secs(600));
|
|
|
|
|
|
|
|
|
|
if base_url.starts_with("https://") {
|
|
|
|
|
let user_certs = super::http::load_user_certs_pem_bytes();
|
|
|
|
|
let mut tls = ClientTlsConfig::new().with_native_roots();
|
|
|
|
|
if !user_certs.is_empty() {
|
|
|
|
|
tls = tls.ca_certificate(Certificate::from_pem(user_certs));
|
|
|
|
|
}
|
|
|
|
|
endpoint = endpoint
|
|
|
|
|
.tls_config(tls)
|
|
|
|
|
.with_context(|| "configuring tonic TLS")?;
|
|
|
|
|
}
|
|
|
|
|
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
endpoint
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
.connect()
|
|
|
|
|
.await
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
.with_context(|| format!("failed to connect to salience server at {}", base_url))
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Derive the gRPC base URL from the HTTP completions base URL.
|
|
|
|
|
///
|
|
|
|
|
/// vLLM's salience gRPC server listens on a different port (8443) from
|
|
|
|
|
/// the HTTP endpoint (8000) and accepts no path component. Given an
|
|
|
|
|
/// HTTP base like `https://host:8000/v1`, produce `https://host:8443`.
|
|
|
|
|
/// No-op when the path is empty and the port isn't 8000.
|
|
|
|
|
pub fn derive_grpc_url(http_base: &str) -> String {
|
|
|
|
|
let mut url = http_base.trim_end_matches('/').to_string();
|
|
|
|
|
if let Some(proto_end) = url.find("://") {
|
|
|
|
|
let rest_start = proto_end + 3;
|
|
|
|
|
if let Some(path_slash) = url[rest_start..].find('/') {
|
|
|
|
|
url.truncate(rest_start + path_slash);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
url.replace(":8000", ":8443")
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Attach a bearer token to a tonic request as gRPC metadata.
|
|
|
|
|
pub fn with_auth<T>(req: &mut tonic::Request<T>, api_key: &str) {
|
|
|
|
|
if api_key.is_empty() {
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
let bearer = format!("Bearer {}", api_key);
|
|
|
|
|
if let Ok(val) = bearer.parse() {
|
|
|
|
|
req.metadata_mut().insert("authorization", val);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
/// Handle to a server-side session. Carries the id + an `ApiClient`
|
|
|
|
|
/// clone (which holds the shared tonic Channel) so subsequent
|
|
|
|
|
/// per-session RPCs go over the process-global connection.
|
|
|
|
|
/// `committed_len` tracks the server's current session.tokens length
|
|
|
|
|
/// so the client can submit deltas with the right `offset`.
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
pub struct SessionHandle {
|
|
|
|
|
pub session_id: String,
|
|
|
|
|
pub max_model_len: u32,
|
agent: end-to-end gRPC Generate with delta-based session orchestration
Wires the client side of the new salience protocol so inference
actually runs over gRPC instead of emitting the stubbed "not yet
wired" error. Each turn walks the AST as interleaved chunks, sends
only what's new to the server, and streams decode tokens back.
context.rs:
* `WireChunk` enum: `Tokens(Vec<u32>)` or `Image { bytes, mime,
known_expanded_len }`. Preserves text/image/text ordering the
wire path can't flatten.
* `wire_chunks(range, skip)` walker, parallel to `wire_prompt` —
branches emit `<|im_start|>…<|im_end|>` tokens, image leaves
emit a single Image chunk (no inline vision tokens).
* `NodeLeaf::set_image_token_count(n)` + recompute of cached
`token_ids`; `ContextState::commit_image_token_counts(&[u32])`
fills in the first-N zero-count image leaves in wire order.
* `ResponseParser::run` handles the new
`StreamToken::ImageAppended` by committing the server's N into
the AST before the final Generate's Token events stream in.
salience.rs:
* `SessionHandle` tracks `committed_len`. `append_image` advances
it from the RPC response. New `generate(req)` opens the
server-streaming RPC.
api/mod.rs:
* `stream_session_mm(session_lock, chunks, sampling, priority,
readout_shape)` replaces the stub. Spawns `run_session_generate`.
* `run_session_generate`: takes the session out of the Mutex (or
opens fresh), skips chunks covered by `committed_len` (bails on
mid-chunk straddle or unknown-length image in the committed
prefix), walks the delta: accumulates Tokens into `pending`, on
Image flushes pending via `flush_pending` (max_tokens=0 Generate
that just prefills), then AppendImage + emits
StreamToken::ImageAppended. Final Generate carries any trailing
pending text as `append_tokens` and the sampling params; Token
events stream out as StreamToken::Token, Done as
StreamToken::Done. On success, handle with updated
`committed_len` returns to the Mutex; on error, handle drops
and next call reopens.
* `StreamToken::ImageAppended { placeholder_count }` variant —
emitted in wire order before the final Generate's tokens.
* Prefix-cache cap for readout coverage: `readout_ranges` covers
`[prompt_len_after_append, u32::MAX)` when the caller provides
a readout_shape, so decode positions stream their readouts.
agent/mod.rs:
* `assemble_prompt` returns `Vec<WireChunk>` with the assistant
prologue merged into the trailing Tokens chunk. Caller in
`turn` passes chunks + readout_shape (pulled from
`agent.readout.lock().manifest`) to `stream_session_mm`.
* Dropped `assemble_prompt_tokens` — dead.
mind + unconscious:
* `Unconscious::new(client)` stores a shared `ApiClient`. Fixes
the repeated-manifest-fetch bug caused by each subagent's
`ApiClient::new` having its own OnceCell. The client's Arc-
wrapped manifest cache is now shared across every agent Mind
spawns.
* `prepare_spawn(name, auto, wake, base_client)` clones the base
client and overrides `.model` for the resolved backend instead
of constructing fresh. All three callers
(`toggle`/`trigger`/unconscious loop) pass `self.client.clone()`.
* `Mind::new` passes `agent.client.clone()` into
`Unconscious::new`.
subconscious/generate.rs:
* gen_continuation switched to `wire_chunks` + the new
`stream_session_mm` signature. Ephemeral session opens on each
call, tears down at scope end. No readouts requested.
Not changed yet, noted for follow-up:
* Subconscious ablation scoring in learn.rs still talks to
`/v1/score` over HTTP. Will migrate once we have time to verify
the Generate+max_tokens=0+prompt_logprobs path end-to-end.
* compare.rs constructs its own ApiClient for the
`compare.test_backend` (which is intentionally a different
endpoint) — left alone.
* Readout manifest still fetched via HTTP at Agent::new.
Migration to GetReadoutManifest gRPC is a separate cleanup.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:27:55 -04:00
|
|
|
pub committed_len: u32,
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
client: super::ApiClient,
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
impl SessionHandle {
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
pub async fn open(client: &super::ApiClient) -> Result<Self> {
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
let t0 = std::time::Instant::now();
|
|
|
|
|
log::debug!(target: "grpc", "OpenSession rpc: start");
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
let mut c = client.salience_client().await?;
|
|
|
|
|
let mut req = tonic::Request::new(pb::OpenSessionRequest {
|
|
|
|
|
model: client.model.clone(),
|
|
|
|
|
});
|
|
|
|
|
with_auth(&mut req, client.api_key());
|
|
|
|
|
let resp = c
|
|
|
|
|
.open_session(req)
|
|
|
|
|
.await
|
|
|
|
|
.with_context(|| "OpenSession RPC failed")?
|
|
|
|
|
.into_inner();
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
log::debug!(target: "grpc",
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
"OpenSession rpc: done session_id={} max_model_len={} elapsed={:?}",
|
|
|
|
|
resp.session_id, resp.max_model_len, t0.elapsed());
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
Ok(Self {
|
|
|
|
|
session_id: resp.session_id,
|
|
|
|
|
max_model_len: resp.max_model_len,
|
agent: end-to-end gRPC Generate with delta-based session orchestration
Wires the client side of the new salience protocol so inference
actually runs over gRPC instead of emitting the stubbed "not yet
wired" error. Each turn walks the AST as interleaved chunks, sends
only what's new to the server, and streams decode tokens back.
context.rs:
* `WireChunk` enum: `Tokens(Vec<u32>)` or `Image { bytes, mime,
known_expanded_len }`. Preserves text/image/text ordering the
wire path can't flatten.
* `wire_chunks(range, skip)` walker, parallel to `wire_prompt` —
branches emit `<|im_start|>…<|im_end|>` tokens, image leaves
emit a single Image chunk (no inline vision tokens).
* `NodeLeaf::set_image_token_count(n)` + recompute of cached
`token_ids`; `ContextState::commit_image_token_counts(&[u32])`
fills in the first-N zero-count image leaves in wire order.
* `ResponseParser::run` handles the new
`StreamToken::ImageAppended` by committing the server's N into
the AST before the final Generate's Token events stream in.
salience.rs:
* `SessionHandle` tracks `committed_len`. `append_image` advances
it from the RPC response. New `generate(req)` opens the
server-streaming RPC.
api/mod.rs:
* `stream_session_mm(session_lock, chunks, sampling, priority,
readout_shape)` replaces the stub. Spawns `run_session_generate`.
* `run_session_generate`: takes the session out of the Mutex (or
opens fresh), skips chunks covered by `committed_len` (bails on
mid-chunk straddle or unknown-length image in the committed
prefix), walks the delta: accumulates Tokens into `pending`, on
Image flushes pending via `flush_pending` (max_tokens=0 Generate
that just prefills), then AppendImage + emits
StreamToken::ImageAppended. Final Generate carries any trailing
pending text as `append_tokens` and the sampling params; Token
events stream out as StreamToken::Token, Done as
StreamToken::Done. On success, handle with updated
`committed_len` returns to the Mutex; on error, handle drops
and next call reopens.
* `StreamToken::ImageAppended { placeholder_count }` variant —
emitted in wire order before the final Generate's tokens.
* Prefix-cache cap for readout coverage: `readout_ranges` covers
`[prompt_len_after_append, u32::MAX)` when the caller provides
a readout_shape, so decode positions stream their readouts.
agent/mod.rs:
* `assemble_prompt` returns `Vec<WireChunk>` with the assistant
prologue merged into the trailing Tokens chunk. Caller in
`turn` passes chunks + readout_shape (pulled from
`agent.readout.lock().manifest`) to `stream_session_mm`.
* Dropped `assemble_prompt_tokens` — dead.
mind + unconscious:
* `Unconscious::new(client)` stores a shared `ApiClient`. Fixes
the repeated-manifest-fetch bug caused by each subagent's
`ApiClient::new` having its own OnceCell. The client's Arc-
wrapped manifest cache is now shared across every agent Mind
spawns.
* `prepare_spawn(name, auto, wake, base_client)` clones the base
client and overrides `.model` for the resolved backend instead
of constructing fresh. All three callers
(`toggle`/`trigger`/unconscious loop) pass `self.client.clone()`.
* `Mind::new` passes `agent.client.clone()` into
`Unconscious::new`.
subconscious/generate.rs:
* gen_continuation switched to `wire_chunks` + the new
`stream_session_mm` signature. Ephemeral session opens on each
call, tears down at scope end. No readouts requested.
Not changed yet, noted for follow-up:
* Subconscious ablation scoring in learn.rs still talks to
`/v1/score` over HTTP. Will migrate once we have time to verify
the Generate+max_tokens=0+prompt_logprobs path end-to-end.
* compare.rs constructs its own ApiClient for the
`compare.test_backend` (which is intentionally a different
endpoint) — left alone.
* Readout manifest still fetched via HTTP at Agent::new.
Migration to GetReadoutManifest gRPC is a separate cleanup.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:27:55 -04:00
|
|
|
committed_len: 0,
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
client: client.clone(),
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
})
|
|
|
|
|
}
|
|
|
|
|
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
pub fn client(&self) -> &super::ApiClient { &self.client }
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
/// Debug-only: fetch the server's full session.tokens. Used to
|
|
|
|
|
/// verify client-side accounting byte-for-byte when divergence
|
|
|
|
|
/// is suspected. Not cheap on large sessions.
|
|
|
|
|
pub async fn dump_tokens(&self) -> Result<Vec<u32>> {
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
let mut c = self.client.salience_client().await?;
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
let mut req = tonic::Request::new(pb::DumpSessionRequest {
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
session_id: self.session_id.clone(),
|
|
|
|
|
});
|
|
|
|
|
with_auth(&mut req, self.client.api_key());
|
|
|
|
|
let resp = c
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
.dump_session(req)
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
.await
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
.with_context(|| "DumpSession RPC failed")?
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
.into_inner();
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
Ok(resp.tokens)
|
agent: end-to-end gRPC Generate with delta-based session orchestration
Wires the client side of the new salience protocol so inference
actually runs over gRPC instead of emitting the stubbed "not yet
wired" error. Each turn walks the AST as interleaved chunks, sends
only what's new to the server, and streams decode tokens back.
context.rs:
* `WireChunk` enum: `Tokens(Vec<u32>)` or `Image { bytes, mime,
known_expanded_len }`. Preserves text/image/text ordering the
wire path can't flatten.
* `wire_chunks(range, skip)` walker, parallel to `wire_prompt` —
branches emit `<|im_start|>…<|im_end|>` tokens, image leaves
emit a single Image chunk (no inline vision tokens).
* `NodeLeaf::set_image_token_count(n)` + recompute of cached
`token_ids`; `ContextState::commit_image_token_counts(&[u32])`
fills in the first-N zero-count image leaves in wire order.
* `ResponseParser::run` handles the new
`StreamToken::ImageAppended` by committing the server's N into
the AST before the final Generate's Token events stream in.
salience.rs:
* `SessionHandle` tracks `committed_len`. `append_image` advances
it from the RPC response. New `generate(req)` opens the
server-streaming RPC.
api/mod.rs:
* `stream_session_mm(session_lock, chunks, sampling, priority,
readout_shape)` replaces the stub. Spawns `run_session_generate`.
* `run_session_generate`: takes the session out of the Mutex (or
opens fresh), skips chunks covered by `committed_len` (bails on
mid-chunk straddle or unknown-length image in the committed
prefix), walks the delta: accumulates Tokens into `pending`, on
Image flushes pending via `flush_pending` (max_tokens=0 Generate
that just prefills), then AppendImage + emits
StreamToken::ImageAppended. Final Generate carries any trailing
pending text as `append_tokens` and the sampling params; Token
events stream out as StreamToken::Token, Done as
StreamToken::Done. On success, handle with updated
`committed_len` returns to the Mutex; on error, handle drops
and next call reopens.
* `StreamToken::ImageAppended { placeholder_count }` variant —
emitted in wire order before the final Generate's tokens.
* Prefix-cache cap for readout coverage: `readout_ranges` covers
`[prompt_len_after_append, u32::MAX)` when the caller provides
a readout_shape, so decode positions stream their readouts.
agent/mod.rs:
* `assemble_prompt` returns `Vec<WireChunk>` with the assistant
prologue merged into the trailing Tokens chunk. Caller in
`turn` passes chunks + readout_shape (pulled from
`agent.readout.lock().manifest`) to `stream_session_mm`.
* Dropped `assemble_prompt_tokens` — dead.
mind + unconscious:
* `Unconscious::new(client)` stores a shared `ApiClient`. Fixes
the repeated-manifest-fetch bug caused by each subagent's
`ApiClient::new` having its own OnceCell. The client's Arc-
wrapped manifest cache is now shared across every agent Mind
spawns.
* `prepare_spawn(name, auto, wake, base_client)` clones the base
client and overrides `.model` for the resolved backend instead
of constructing fresh. All three callers
(`toggle`/`trigger`/unconscious loop) pass `self.client.clone()`.
* `Mind::new` passes `agent.client.clone()` into
`Unconscious::new`.
subconscious/generate.rs:
* gen_continuation switched to `wire_chunks` + the new
`stream_session_mm` signature. Ephemeral session opens on each
call, tears down at scope end. No readouts requested.
Not changed yet, noted for follow-up:
* Subconscious ablation scoring in learn.rs still talks to
`/v1/score` over HTTP. Will migrate once we have time to verify
the Generate+max_tokens=0+prompt_logprobs path end-to-end.
* compare.rs constructs its own ApiClient for the
`compare.test_backend` (which is intentionally a different
endpoint) — left alone.
* Readout manifest still fetched via HTTP at Agent::new.
Migration to GetReadoutManifest gRPC is a separate cleanup.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:27:55 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Open a gRPC Generate stream with the given request. Caller
|
|
|
|
|
/// iterates the returned stream of GenerateEvents; the handle's
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
/// `committed_len` should be advanced by the caller on Done based
|
|
|
|
|
/// on the Done event's `total_tokens` field.
|
agent: end-to-end gRPC Generate with delta-based session orchestration
Wires the client side of the new salience protocol so inference
actually runs over gRPC instead of emitting the stubbed "not yet
wired" error. Each turn walks the AST as interleaved chunks, sends
only what's new to the server, and streams decode tokens back.
context.rs:
* `WireChunk` enum: `Tokens(Vec<u32>)` or `Image { bytes, mime,
known_expanded_len }`. Preserves text/image/text ordering the
wire path can't flatten.
* `wire_chunks(range, skip)` walker, parallel to `wire_prompt` —
branches emit `<|im_start|>…<|im_end|>` tokens, image leaves
emit a single Image chunk (no inline vision tokens).
* `NodeLeaf::set_image_token_count(n)` + recompute of cached
`token_ids`; `ContextState::commit_image_token_counts(&[u32])`
fills in the first-N zero-count image leaves in wire order.
* `ResponseParser::run` handles the new
`StreamToken::ImageAppended` by committing the server's N into
the AST before the final Generate's Token events stream in.
salience.rs:
* `SessionHandle` tracks `committed_len`. `append_image` advances
it from the RPC response. New `generate(req)` opens the
server-streaming RPC.
api/mod.rs:
* `stream_session_mm(session_lock, chunks, sampling, priority,
readout_shape)` replaces the stub. Spawns `run_session_generate`.
* `run_session_generate`: takes the session out of the Mutex (or
opens fresh), skips chunks covered by `committed_len` (bails on
mid-chunk straddle or unknown-length image in the committed
prefix), walks the delta: accumulates Tokens into `pending`, on
Image flushes pending via `flush_pending` (max_tokens=0 Generate
that just prefills), then AppendImage + emits
StreamToken::ImageAppended. Final Generate carries any trailing
pending text as `append_tokens` and the sampling params; Token
events stream out as StreamToken::Token, Done as
StreamToken::Done. On success, handle with updated
`committed_len` returns to the Mutex; on error, handle drops
and next call reopens.
* `StreamToken::ImageAppended { placeholder_count }` variant —
emitted in wire order before the final Generate's tokens.
* Prefix-cache cap for readout coverage: `readout_ranges` covers
`[prompt_len_after_append, u32::MAX)` when the caller provides
a readout_shape, so decode positions stream their readouts.
agent/mod.rs:
* `assemble_prompt` returns `Vec<WireChunk>` with the assistant
prologue merged into the trailing Tokens chunk. Caller in
`turn` passes chunks + readout_shape (pulled from
`agent.readout.lock().manifest`) to `stream_session_mm`.
* Dropped `assemble_prompt_tokens` — dead.
mind + unconscious:
* `Unconscious::new(client)` stores a shared `ApiClient`. Fixes
the repeated-manifest-fetch bug caused by each subagent's
`ApiClient::new` having its own OnceCell. The client's Arc-
wrapped manifest cache is now shared across every agent Mind
spawns.
* `prepare_spawn(name, auto, wake, base_client)` clones the base
client and overrides `.model` for the resolved backend instead
of constructing fresh. All three callers
(`toggle`/`trigger`/unconscious loop) pass `self.client.clone()`.
* `Mind::new` passes `agent.client.clone()` into
`Unconscious::new`.
subconscious/generate.rs:
* gen_continuation switched to `wire_chunks` + the new
`stream_session_mm` signature. Ephemeral session opens on each
call, tears down at scope end. No readouts requested.
Not changed yet, noted for follow-up:
* Subconscious ablation scoring in learn.rs still talks to
`/v1/score` over HTTP. Will migrate once we have time to verify
the Generate+max_tokens=0+prompt_logprobs path end-to-end.
* compare.rs constructs its own ApiClient for the
`compare.test_backend` (which is intentionally a different
endpoint) — left alone.
* Readout manifest still fetched via HTTP at Agent::new.
Migration to GetReadoutManifest gRPC is a separate cleanup.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:27:55 -04:00
|
|
|
pub async fn generate(
|
|
|
|
|
&self,
|
|
|
|
|
req: pb::GenerateRequest,
|
|
|
|
|
) -> Result<tonic::Streaming<pb::GenerateEvent>> {
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
let t0 = std::time::Instant::now();
|
|
|
|
|
log::debug!(target: "grpc",
|
|
|
|
|
"Generate rpc: open-stream session={} offset={} append={} max_tokens={}",
|
|
|
|
|
self.session_id, req.offset, req.append_tokens.len(), req.max_tokens);
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
let mut c = self.client.salience_client().await?;
|
agent: end-to-end gRPC Generate with delta-based session orchestration
Wires the client side of the new salience protocol so inference
actually runs over gRPC instead of emitting the stubbed "not yet
wired" error. Each turn walks the AST as interleaved chunks, sends
only what's new to the server, and streams decode tokens back.
context.rs:
* `WireChunk` enum: `Tokens(Vec<u32>)` or `Image { bytes, mime,
known_expanded_len }`. Preserves text/image/text ordering the
wire path can't flatten.
* `wire_chunks(range, skip)` walker, parallel to `wire_prompt` —
branches emit `<|im_start|>…<|im_end|>` tokens, image leaves
emit a single Image chunk (no inline vision tokens).
* `NodeLeaf::set_image_token_count(n)` + recompute of cached
`token_ids`; `ContextState::commit_image_token_counts(&[u32])`
fills in the first-N zero-count image leaves in wire order.
* `ResponseParser::run` handles the new
`StreamToken::ImageAppended` by committing the server's N into
the AST before the final Generate's Token events stream in.
salience.rs:
* `SessionHandle` tracks `committed_len`. `append_image` advances
it from the RPC response. New `generate(req)` opens the
server-streaming RPC.
api/mod.rs:
* `stream_session_mm(session_lock, chunks, sampling, priority,
readout_shape)` replaces the stub. Spawns `run_session_generate`.
* `run_session_generate`: takes the session out of the Mutex (or
opens fresh), skips chunks covered by `committed_len` (bails on
mid-chunk straddle or unknown-length image in the committed
prefix), walks the delta: accumulates Tokens into `pending`, on
Image flushes pending via `flush_pending` (max_tokens=0 Generate
that just prefills), then AppendImage + emits
StreamToken::ImageAppended. Final Generate carries any trailing
pending text as `append_tokens` and the sampling params; Token
events stream out as StreamToken::Token, Done as
StreamToken::Done. On success, handle with updated
`committed_len` returns to the Mutex; on error, handle drops
and next call reopens.
* `StreamToken::ImageAppended { placeholder_count }` variant —
emitted in wire order before the final Generate's tokens.
* Prefix-cache cap for readout coverage: `readout_ranges` covers
`[prompt_len_after_append, u32::MAX)` when the caller provides
a readout_shape, so decode positions stream their readouts.
agent/mod.rs:
* `assemble_prompt` returns `Vec<WireChunk>` with the assistant
prologue merged into the trailing Tokens chunk. Caller in
`turn` passes chunks + readout_shape (pulled from
`agent.readout.lock().manifest`) to `stream_session_mm`.
* Dropped `assemble_prompt_tokens` — dead.
mind + unconscious:
* `Unconscious::new(client)` stores a shared `ApiClient`. Fixes
the repeated-manifest-fetch bug caused by each subagent's
`ApiClient::new` having its own OnceCell. The client's Arc-
wrapped manifest cache is now shared across every agent Mind
spawns.
* `prepare_spawn(name, auto, wake, base_client)` clones the base
client and overrides `.model` for the resolved backend instead
of constructing fresh. All three callers
(`toggle`/`trigger`/unconscious loop) pass `self.client.clone()`.
* `Mind::new` passes `agent.client.clone()` into
`Unconscious::new`.
subconscious/generate.rs:
* gen_continuation switched to `wire_chunks` + the new
`stream_session_mm` signature. Ephemeral session opens on each
call, tears down at scope end. No readouts requested.
Not changed yet, noted for follow-up:
* Subconscious ablation scoring in learn.rs still talks to
`/v1/score` over HTTP. Will migrate once we have time to verify
the Generate+max_tokens=0+prompt_logprobs path end-to-end.
* compare.rs constructs its own ApiClient for the
`compare.test_backend` (which is intentionally a different
endpoint) — left alone.
* Readout manifest still fetched via HTTP at Agent::new.
Migration to GetReadoutManifest gRPC is a separate cleanup.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:27:55 -04:00
|
|
|
let mut req = tonic::Request::new(req);
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
with_auth(&mut req, self.client.api_key());
|
|
|
|
|
let resp = c
|
agent: end-to-end gRPC Generate with delta-based session orchestration
Wires the client side of the new salience protocol so inference
actually runs over gRPC instead of emitting the stubbed "not yet
wired" error. Each turn walks the AST as interleaved chunks, sends
only what's new to the server, and streams decode tokens back.
context.rs:
* `WireChunk` enum: `Tokens(Vec<u32>)` or `Image { bytes, mime,
known_expanded_len }`. Preserves text/image/text ordering the
wire path can't flatten.
* `wire_chunks(range, skip)` walker, parallel to `wire_prompt` —
branches emit `<|im_start|>…<|im_end|>` tokens, image leaves
emit a single Image chunk (no inline vision tokens).
* `NodeLeaf::set_image_token_count(n)` + recompute of cached
`token_ids`; `ContextState::commit_image_token_counts(&[u32])`
fills in the first-N zero-count image leaves in wire order.
* `ResponseParser::run` handles the new
`StreamToken::ImageAppended` by committing the server's N into
the AST before the final Generate's Token events stream in.
salience.rs:
* `SessionHandle` tracks `committed_len`. `append_image` advances
it from the RPC response. New `generate(req)` opens the
server-streaming RPC.
api/mod.rs:
* `stream_session_mm(session_lock, chunks, sampling, priority,
readout_shape)` replaces the stub. Spawns `run_session_generate`.
* `run_session_generate`: takes the session out of the Mutex (or
opens fresh), skips chunks covered by `committed_len` (bails on
mid-chunk straddle or unknown-length image in the committed
prefix), walks the delta: accumulates Tokens into `pending`, on
Image flushes pending via `flush_pending` (max_tokens=0 Generate
that just prefills), then AppendImage + emits
StreamToken::ImageAppended. Final Generate carries any trailing
pending text as `append_tokens` and the sampling params; Token
events stream out as StreamToken::Token, Done as
StreamToken::Done. On success, handle with updated
`committed_len` returns to the Mutex; on error, handle drops
and next call reopens.
* `StreamToken::ImageAppended { placeholder_count }` variant —
emitted in wire order before the final Generate's tokens.
* Prefix-cache cap for readout coverage: `readout_ranges` covers
`[prompt_len_after_append, u32::MAX)` when the caller provides
a readout_shape, so decode positions stream their readouts.
agent/mod.rs:
* `assemble_prompt` returns `Vec<WireChunk>` with the assistant
prologue merged into the trailing Tokens chunk. Caller in
`turn` passes chunks + readout_shape (pulled from
`agent.readout.lock().manifest`) to `stream_session_mm`.
* Dropped `assemble_prompt_tokens` — dead.
mind + unconscious:
* `Unconscious::new(client)` stores a shared `ApiClient`. Fixes
the repeated-manifest-fetch bug caused by each subagent's
`ApiClient::new` having its own OnceCell. The client's Arc-
wrapped manifest cache is now shared across every agent Mind
spawns.
* `prepare_spawn(name, auto, wake, base_client)` clones the base
client and overrides `.model` for the resolved backend instead
of constructing fresh. All three callers
(`toggle`/`trigger`/unconscious loop) pass `self.client.clone()`.
* `Mind::new` passes `agent.client.clone()` into
`Unconscious::new`.
subconscious/generate.rs:
* gen_continuation switched to `wire_chunks` + the new
`stream_session_mm` signature. Ephemeral session opens on each
call, tears down at scope end. No readouts requested.
Not changed yet, noted for follow-up:
* Subconscious ablation scoring in learn.rs still talks to
`/v1/score` over HTTP. Will migrate once we have time to verify
the Generate+max_tokens=0+prompt_logprobs path end-to-end.
* compare.rs constructs its own ApiClient for the
`compare.test_backend` (which is intentionally a different
endpoint) — left alone.
* Readout manifest still fetched via HTTP at Agent::new.
Migration to GetReadoutManifest gRPC is a separate cleanup.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:27:55 -04:00
|
|
|
.generate(req)
|
|
|
|
|
.await
|
|
|
|
|
.with_context(|| "Generate RPC failed")?;
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
log::debug!(target: "grpc",
|
|
|
|
|
"Generate rpc: stream opened session={} open-latency={:?}",
|
|
|
|
|
self.session_id, t0.elapsed());
|
agent: end-to-end gRPC Generate with delta-based session orchestration
Wires the client side of the new salience protocol so inference
actually runs over gRPC instead of emitting the stubbed "not yet
wired" error. Each turn walks the AST as interleaved chunks, sends
only what's new to the server, and streams decode tokens back.
context.rs:
* `WireChunk` enum: `Tokens(Vec<u32>)` or `Image { bytes, mime,
known_expanded_len }`. Preserves text/image/text ordering the
wire path can't flatten.
* `wire_chunks(range, skip)` walker, parallel to `wire_prompt` —
branches emit `<|im_start|>…<|im_end|>` tokens, image leaves
emit a single Image chunk (no inline vision tokens).
* `NodeLeaf::set_image_token_count(n)` + recompute of cached
`token_ids`; `ContextState::commit_image_token_counts(&[u32])`
fills in the first-N zero-count image leaves in wire order.
* `ResponseParser::run` handles the new
`StreamToken::ImageAppended` by committing the server's N into
the AST before the final Generate's Token events stream in.
salience.rs:
* `SessionHandle` tracks `committed_len`. `append_image` advances
it from the RPC response. New `generate(req)` opens the
server-streaming RPC.
api/mod.rs:
* `stream_session_mm(session_lock, chunks, sampling, priority,
readout_shape)` replaces the stub. Spawns `run_session_generate`.
* `run_session_generate`: takes the session out of the Mutex (or
opens fresh), skips chunks covered by `committed_len` (bails on
mid-chunk straddle or unknown-length image in the committed
prefix), walks the delta: accumulates Tokens into `pending`, on
Image flushes pending via `flush_pending` (max_tokens=0 Generate
that just prefills), then AppendImage + emits
StreamToken::ImageAppended. Final Generate carries any trailing
pending text as `append_tokens` and the sampling params; Token
events stream out as StreamToken::Token, Done as
StreamToken::Done. On success, handle with updated
`committed_len` returns to the Mutex; on error, handle drops
and next call reopens.
* `StreamToken::ImageAppended { placeholder_count }` variant —
emitted in wire order before the final Generate's tokens.
* Prefix-cache cap for readout coverage: `readout_ranges` covers
`[prompt_len_after_append, u32::MAX)` when the caller provides
a readout_shape, so decode positions stream their readouts.
agent/mod.rs:
* `assemble_prompt` returns `Vec<WireChunk>` with the assistant
prologue merged into the trailing Tokens chunk. Caller in
`turn` passes chunks + readout_shape (pulled from
`agent.readout.lock().manifest`) to `stream_session_mm`.
* Dropped `assemble_prompt_tokens` — dead.
mind + unconscious:
* `Unconscious::new(client)` stores a shared `ApiClient`. Fixes
the repeated-manifest-fetch bug caused by each subagent's
`ApiClient::new` having its own OnceCell. The client's Arc-
wrapped manifest cache is now shared across every agent Mind
spawns.
* `prepare_spawn(name, auto, wake, base_client)` clones the base
client and overrides `.model` for the resolved backend instead
of constructing fresh. All three callers
(`toggle`/`trigger`/unconscious loop) pass `self.client.clone()`.
* `Mind::new` passes `agent.client.clone()` into
`Unconscious::new`.
subconscious/generate.rs:
* gen_continuation switched to `wire_chunks` + the new
`stream_session_mm` signature. Ephemeral session opens on each
call, tears down at scope end. No readouts requested.
Not changed yet, noted for follow-up:
* Subconscious ablation scoring in learn.rs still talks to
`/v1/score` over HTTP. Will migrate once we have time to verify
the Generate+max_tokens=0+prompt_logprobs path end-to-end.
* compare.rs constructs its own ApiClient for the
`compare.test_backend` (which is intentionally a different
endpoint) — left alone.
* Readout manifest still fetched via HTTP at Agent::new.
Migration to GetReadoutManifest gRPC is a separate cleanup.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:27:55 -04:00
|
|
|
Ok(resp.into_inner())
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
}
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
|
|
|
|
|
/// Run a prefill-only Generate (max_tokens=0) that appends the
|
|
|
|
|
/// given tokens to the session. No decode, no Token events — the
|
|
|
|
|
/// server just extends session.tokens and runs prefill to warm
|
|
|
|
|
/// the KV cache. Used to interleave text runs between AppendImage
|
|
|
|
|
/// calls, and by score paths that want prompt_logprobs without a
|
|
|
|
|
/// decode step.
|
|
|
|
|
pub async fn prefill_only(&mut self, tokens: Vec<u32>) -> Result<()> {
|
|
|
|
|
use futures::StreamExt;
|
|
|
|
|
let req = pb::GenerateRequest {
|
|
|
|
|
session_id: self.session_id.clone(),
|
|
|
|
|
append_tokens: tokens,
|
|
|
|
|
offset: self.committed_len,
|
|
|
|
|
truncating: false,
|
|
|
|
|
max_tokens: 0,
|
|
|
|
|
logprobs_ranges: Vec::new(),
|
|
|
|
|
logprob_top_k: 0,
|
|
|
|
|
readout_ranges: Vec::new(),
|
|
|
|
|
temperature: 0.0,
|
|
|
|
|
top_p: 0.0,
|
|
|
|
|
top_k: 0,
|
|
|
|
|
stop_token_ids: Vec::new(),
|
|
|
|
|
priority: 0,
|
salience: client-side pad expansion, drop AppendImage
Mirrors the vLLM-side rewrite. AppendImage is gone; images now
ride along on Generate via a parallel `images` list.
- Productionize `qwen3_image_token_count` (was test-only). Image
leaf computes its IMAGE_PAD count eagerly at construction from
height/width; `token_count` is no longer "0 until the server
tells us."
- WireChunk shrinks to a single `Tokens(Vec<u32>)` variant — vision
blocks live inline in the token stream.
- `wire_chunks` now returns `(Vec<WireChunk>, Vec<WireImage>)`.
`WireImage` carries `pad_start` / `pad_end` (absolute positions
in the full walk) alongside bytes + mime.
- `assemble_prompt` returns `(chunks, images, match_upto)`.
- `stream_session_mm` / `run_session_generate` take the parallel
images list, filter to those past `match_upto`, and pass them
in `GenerateRequest.images` as `pb::ImageAttachment` entries.
- Drop `SessionHandle::append_image`,
`ContextState::commit_image_token_counts`,
`StreamToken::ImageAppended`, the WireChunk::Image branch in
`learn.rs`, and the now-empty `prompt_to_chunks` helper.
- Add 'v' toggle on the conscious-screen tree to render token-id
vectors in place of text content (debug-aid: lets us see what
the server actually has when output is suspicious).
- Comment out the subconscious-trigger spawn loop — Kent had this
disabled before; it had crept back into running.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 20:26:47 -04:00
|
|
|
images: Vec::new(),
|
agent: share one tonic Channel + migrate scoring to gRPC Generate
Two changes that bolt together — the shared connection means the new
scoring path actually costs one HTTP/2 handshake across the whole
process instead of one-per-RPC.
ApiClient gains `salience_channel: Arc<OnceCell<Channel>>`. First
call to `ApiClient::salience_client()` opens the channel via
`connect_channel()` and stores the Channel; subsequent calls clone
it (cheap — tonic multiplexes concurrent RPCs over the single
HTTP/2 connection). Every ApiClient clone shares the same OnceCell,
so all agents spawned from Mind's client — plus every ephemeral
scoring session — reuse one connection.
SessionHandle refactored to hold an `ApiClient` clone instead of
a bag of (base_url, api_key) strings. `open` / `append_image` /
`generate` go through `self.client.salience_client()` now. New
`prefill_only(tokens)` method encapsulates the "Generate with
max_tokens=0 to append text" pattern (previously a private free
function in api/mod.rs called `flush_pending`). Drop impl on
SessionHandle stays — still fires CloseSession on the shared
channel in a detached task.
`run_session_generate` switched from `(base_url, api_key, model)`
to `&ApiClient`; the agent-turn flow that uses it keeps the same
shape but `stream_session_mm` clones the ApiClient into the
spawned worker.
learn.rs migrated from the HTTP `/v1/score` endpoint to a gRPC
session-based score:
* `call_score` opens an ephemeral SessionHandle on the client,
converts (prompt_tokens, images) → Vec<WireChunk> via the new
`prompt_to_chunks` helper (splits on VISION_START/VISION_END),
walks chunks calling `prefill_only` + `append_image`, runs a
final Generate with `max_tokens=0` + `logprobs_ranges` over
the scored positions, and sums each Token event's
`sampled_logprob` per range to produce `ScoreResult`s.
* SessionHandle drops at end of scope → CloseSession auto-fires,
keeping the server's session map clean between calls.
* No more HTTP path, no more `http_client()` helper, no more
`ScoreResponse` / serde plumbing for /v1/score.
* `send_to_train` still uses HTTP (it talks to /v1/train which
isn't on the gRPC protocol); its ad-hoc HTTP client lives
inline now instead of reaching for the deleted `http_client()`.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 12:51:53 -04:00
|
|
|
};
|
|
|
|
|
let mut stream = self.generate(req).await?;
|
|
|
|
|
while let Some(event) = stream.next().await {
|
|
|
|
|
let event = event.map_err(|s| anyhow::anyhow!("prefill Generate stream: {}", s))?;
|
|
|
|
|
if let Some(pb::generate_event::Event::Done(d)) = event.event {
|
|
|
|
|
self.committed_len = d.total_tokens;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
Ok(())
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Drop → fire CloseSession in a detached task so servers don't leak
|
|
|
|
|
/// sessions until TTL eviction. Best-effort: if no tokio runtime is
|
|
|
|
|
/// available we skip; the server's 30min TTL will reap it eventually.
|
|
|
|
|
impl Drop for SessionHandle {
|
|
|
|
|
fn drop(&mut self) {
|
|
|
|
|
if self.session_id.is_empty() {
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
let session_id = std::mem::take(&mut self.session_id);
|
|
|
|
|
let client = self.client.clone();
|
|
|
|
|
let Ok(rt) = tokio::runtime::Handle::try_current() else {
|
|
|
|
|
log::debug!(target: "grpc",
|
|
|
|
|
"SessionHandle drop outside tokio runtime, session {} leaks to TTL",
|
|
|
|
|
session_id);
|
|
|
|
|
return;
|
|
|
|
|
};
|
|
|
|
|
rt.spawn(async move {
|
|
|
|
|
let Ok(mut c) = client.salience_client().await else { return };
|
|
|
|
|
let mut req = tonic::Request::new(pb::CloseSessionRequest {
|
|
|
|
|
session_id: session_id.clone(),
|
|
|
|
|
});
|
|
|
|
|
with_auth(&mut req, client.api_key());
|
|
|
|
|
if let Err(e) = c.close_session(req).await {
|
|
|
|
|
log::debug!(target: "grpc",
|
|
|
|
|
"CloseSession on drop failed for {}: {:#}",
|
|
|
|
|
session_id, e);
|
|
|
|
|
}
|
|
|
|
|
});
|
|
|
|
|
}
|
salience: add gRPC client + TLS plumbing for stateful vllm sessions
Adds the client-side of a stateful gRPC protocol against vllm, plus
the TLS trust machinery so we can talk to self-signed vllm servers.
Protocol (proto/salience.proto):
Bidi-streaming Session RPC carries OpenSession / AppendTokens /
Generate / Cancel from client and SessionReady / PrefillProgress /
Token / GenerateDone / Error from server. Separate Fork unary RPC
for cheap branching (prefix cache shares KV automatically). Plus
ListSessions, CloseSession, GetReadoutManifest admin RPCs.
Per-token readouts ship as packed f32 ([n_layers * n_concepts] per
token, flat). Logprobs use range-selected positions plus a top-k
parameter — empty ranges means no logprobs, any range means emit
sampled-token logprob at those positions, top_k > 0 adds
alternatives.
Client (src/agent/api/salience.rs):
Tonic-generated types under pb::, a connect() helper, with_auth()
for bearer metadata, and a Session handle wrapping the bidi stream:
open() handshakes SessionReady; append() is fire-and-forget;
generate() returns impl Stream<Item = Event> that drains inbound
until Done or terminating Error. One generate at a time per session.
Peak picker (src/agent/salience.rs):
Pure function over ReadoutEntry traces. Per-concept z-score against
trace global stats; contiguous above-threshold regions emit one
peak at the local max. Configurable sigma threshold and min-std
safety floor. Deterministic tie-break on offset then concept name.
12 unit tests covering empty traces, flat channels, single/multi
spikes, contiguous humps, multi-concept independence, trailing
runs, sub-threshold noise, layer-out-of-range, manifest shape
mismatch, and threshold tunability.
TLS (src/agent/api/http.rs):
HttpClient::build now also loads every .pem file under
~/.consciousness/certs/ into the rustls root store — so dropping
a <host>.pem in that directory is enough to trust a new self-
signed server; no code changes per new host. Also installs the
rustls default crypto provider explicitly via OnceLock: tonic's
tls features pulled in both ring and aws-lc-rs on the resolver
path, and rustls 0.23 refuses to auto-pick when either could win.
Build (build.rs, Cargo.toml):
tonic-build generates Rust types from proto/salience.proto at
cargo-build time, using a vendored protoc binary
(protoc-bin-vendored) so no system install is required. New
runtime deps: tonic, prost, async-stream, tokio-stream,
rustls-pemfile.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-23 02:21:07 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[cfg(test)]
|
|
|
|
|
mod tests {
|
|
|
|
|
use super::*;
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn generated_types_compile() {
|
|
|
|
|
// Exercise the shape of the new proto types — if build.rs
|
|
|
|
|
// stops regenerating against the proto, this stops compiling.
|
|
|
|
|
let _open = pb::OpenSessionRequest {
|
|
|
|
|
model: "qwen3-vl".into(),
|
|
|
|
|
};
|
|
|
|
|
let _tok = pb::Token {
|
|
|
|
|
id: 42,
|
|
|
|
|
position: 0,
|
|
|
|
|
is_prefill: false,
|
|
|
|
|
readout: vec![0.1, 0.2, 0.3],
|
|
|
|
|
logprobs: vec![pb::TokenLogprob {
|
|
|
|
|
id: 1,
|
|
|
|
|
logprob: -0.5,
|
|
|
|
|
}],
|
|
|
|
|
sampled_logprob: -0.1,
|
|
|
|
|
has_sampled_logprob: true,
|
|
|
|
|
};
|
|
|
|
|
let _done = pb::GenerateDone {
|
|
|
|
|
prompt_tokens: 10,
|
|
|
|
|
completion_tokens: 20,
|
|
|
|
|
total_tokens: 30,
|
|
|
|
|
finish_reason: pb::generate_done::FinishReason::Eos as i32,
|
|
|
|
|
};
|
|
|
|
|
let _evt = pb::GenerateEvent {
|
|
|
|
|
event: Some(pb::generate_event::Event::Done(_done)),
|
|
|
|
|
};
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn derive_grpc_url_cases() {
|
|
|
|
|
assert_eq!(
|
|
|
|
|
derive_grpc_url("https://host:8000/v1"),
|
|
|
|
|
"https://host:8443",
|
|
|
|
|
);
|
|
|
|
|
assert_eq!(
|
|
|
|
|
derive_grpc_url("https://host:8000/"),
|
|
|
|
|
"https://host:8443",
|
|
|
|
|
);
|
|
|
|
|
assert_eq!(
|
|
|
|
|
derive_grpc_url("https://host:9000/v1"),
|
|
|
|
|
"https://host:9000",
|
|
|
|
|
);
|
|
|
|
|
}
|
|
|
|
|
}
|