agent: add NodeBody::Image for Qwen3-VL vision input
Images are rendered as `<|vision_start|>` + N × `<|image_pad|>` + `<|vision_end|>` where N is computed from the image dimensions using Qwen3-VL's smart_resize rules (patch_size=16, merge_size=2, min=64K, max=16M pixels). The token count matches what vLLM will produce at request time, so budget accounting stays accurate. Bytes are stored inline on the leaf and base64-encoded in the JSON form. Token IDs are hand-assembled instead of re-running the tokenizer on a potentially-huge placeholder string. Follow-ups: view_image tool rewrite, multi_modal_data on the vLLM request, API-layer plumbing from leaf bytes to request body. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
parent
592a3e2e52
commit
0bf71b9110
3 changed files with 211 additions and 20 deletions
|
|
@ -486,6 +486,11 @@ impl InteractScreen {
|
|||
if t.is_empty() { vec![] }
|
||||
else { vec![(PaneTarget::ToolResult, text, Marker::None)] }
|
||||
}
|
||||
NodeBody::Image { orig_height, orig_width, .. } => {
|
||||
vec![(PaneTarget::Conversation,
|
||||
format!("[image {}x{}]", orig_width, orig_height),
|
||||
Marker::None)]
|
||||
}
|
||||
}
|
||||
}
|
||||
AstNode::Branch { role, children, .. } => {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue