vLLM priority scheduling for agents

Thread request priority through the API call chain to vLLM's priority scheduler. Lower value = higher priority, with preemption. Priority is set per-agent in the .agent header: - interactive (runner): 0 (default, highest) - surface-observe: 1 (near-realtime, watches conversation) - all other agents: 10 (batch, default if not specified) Requires vLLM started with --scheduling-policy priority. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-01 23:21:39 -04:00 · 2026-04-01 23:21:39 -04:00 · c72eb4d528
commit c72eb4d528
parent 503e2995c1
8 changed files with 27 additions and 7 deletions
--- a/src/agent/api/openai.rs
+++ b/src/agent/api/openai.rs
@ -26,6 +26,7 @@ pub async fn stream_events(
    ui_tx: &UiSender,
    reasoning_effort: &str,
    temperature: Option<f32>,
+    priority: Option<i32>,
 ) -> Result<()> {
    let request = ChatRequest {
        model: model.to_string(),
@ -44,6 +45,7 @@ pub async fn stream_events(
            None
        },
        chat_template_kwargs: None,
+        priority,
    };

    let url = format!("{}/chat/completions", base_url);