vLLM priority scheduling for agents

Thread request priority through the API call chain to vLLM's
priority scheduler. Lower value = higher priority, with preemption.

Priority is set per-agent in the .agent header:
- interactive (runner): 0 (default, highest)
- surface-observe: 1 (near-realtime, watches conversation)
- all other agents: 10 (batch, default if not specified)

Requires vLLM started with --scheduling-policy priority.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
Kent Overstreet 2026-04-01 23:21:39 -04:00
parent 503e2995c1
commit c72eb4d528
8 changed files with 27 additions and 7 deletions

View file

@ -26,6 +26,7 @@ pub async fn stream_events(
ui_tx: &UiSender,
reasoning_effort: &str,
temperature: Option<f32>,
priority: Option<i32>,
) -> Result<()> {
let request = ChatRequest {
model: model.to_string(),
@ -44,6 +45,7 @@ pub async fn stream_events(
None
},
chat_template_kwargs: None,
priority,
};
let url = format!("{}/chat/completions", base_url);