Add /v1/completions streaming path with raw token IDs

New stream_completions() in openai.rs sends prompt as token IDs to
the completions endpoint instead of JSON messages to chat/completions.
Handles <think> tags in the response (split into Reasoning events)
and stops on <|im_end|> token.

start_stream_completions() on ApiClient provides the same interface
as start_stream() but takes token IDs instead of Messages.

The turn loop in Agent::turn() uses completions when the tokenizer
is initialized, falling back to the chat API otherwise. This allows
gradual migration — consciousness uses completions (Qwen tokenizer),
Claude Code hook still uses chat API (Anthropic).

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
Kent Overstreet 2026-04-08 11:42:22 -04:00
parent e9765799c4
commit f458af6dec
3 changed files with 188 additions and 8 deletions

View file

@ -133,6 +133,34 @@ impl ApiClient {
(rx, AbortOnDrop(handle))
}
/// Start a streaming completion with raw token IDs.
/// No message formatting — the caller provides the complete prompt as tokens.
pub(crate) fn start_stream_completions(
&self,
prompt_tokens: &[u32],
sampling: SamplingParams,
priority: Option<i32>,
) -> (mpsc::UnboundedReceiver<StreamEvent>, AbortOnDrop) {
let (tx, rx) = mpsc::unbounded_channel();
let client = self.client.clone();
let api_key = self.api_key.clone();
let model = self.model.clone();
let prompt_tokens = prompt_tokens.to_vec();
let base_url = self.base_url.clone();
let handle = tokio::spawn(async move {
let result = openai::stream_completions(
&client, &base_url, &api_key, &model,
&prompt_tokens, &tx, sampling, priority,
).await;
if let Err(e) = result {
let _ = tx.send(StreamEvent::Error(e.to_string()));
}
});
(rx, AbortOnDrop(handle))
}
pub(crate) async fn chat_completion_stream_temp(
&self,
messages: &[Message],