agent: add NodeBody::Image for Qwen3-VL vision input

Images are rendered as `<|vision_start|>` + N × `<|image_pad|>` + `<|vision_end|>` where N is computed from the image dimensions using Qwen3-VL's smart_resize rules (patch_size=16, merge_size=2, min=64K, max=16M pixels). The token count matches what vLLM will produce at request time, so budget accounting stays accurate. Bytes are stored inline on the leaf and base64-encoded in the JSON form. Token IDs are hand-assembled instead of re-running the tokenizer on a potentially-huge placeholder string. Follow-ups: view_image tool rewrite, multi_modal_data on the vLLM request, API-layer plumbing from leaf bytes to request body. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-16 18:00:10 -04:00 · 2026-04-16 18:00:10 -04:00 · 0bf71b9110
commit 0bf71b9110
parent 592a3e2e52
3 changed files with 211 additions and 20 deletions
--- a/src/user/chat.rs
+++ b/src/user/chat.rs
@ -486,6 +486,11 @@ impl InteractScreen {
                        if t.is_empty() { vec![] }
                        else { vec![(PaneTarget::ToolResult, text, Marker::None)] }
                    }
+                    NodeBody::Image { orig_height, orig_width, .. } => {
+                        vec![(PaneTarget::Conversation,
+                              format!("[image {}x{}]", orig_width, orig_height),
+                              Marker::None)]
+                    }
                }
            }
            AstNode::Branch { role, children, .. } => {