Brainstorming URL-fetch preprocessing #3

Closed
opened 2026-04-12 09:18:12 +00:00 by spqrz · 0 comments

Currently web_fetch in src/agent/tools/web.rs returns the first 30000 bytes of the response body including all HTML.

This is a problem because web pages that are more than 30k of HTML are incredibly common. E.g. AO3 is currently serving the first chapter of my novel at 53,868 bytes, and cutting it off at 30k lands in the middle of a paragraph. So it's probably OK if all you want is a quick idea of what the page is about, but not if you actually want to read the whole thing.

I'm wondering about preprocessing the HTML to reduce the junk. True, POC may sometimes need to see the actual original HTML just as humans sometimes need to use "View Source", but we don't live in a View Source world (well, I did when I used web over email in 1996 but we've moved on). Since the model has access to a command shell, nothing stops POC from running her own curl and using whatever custom code is needed to investigate the HTML any way she wants, therefore I think web_fetch should be optimised for the most common case of "just tell me what the page says".

My idea: html2markdown. Running this manually reduced that chapter response from 53,868 bytes to 23,003 bytes, meaning the whole thing fits within that 30k limit.

If I also run it through a "strip out everything except the actual chapter text" stage (manually) I get 12,300 bytes of Markdown (which is less than 1k more than my original source text) but then you lose the link to the next chapter. So adding in an algorithmic "strip non article text" would likely save an extra 10k but at the expense of making it more awkward to actually navigate if you need to. Since the model's own attention mechanism is going to be more reliable than any simple algorithm at deciding which parts to cut, I'd err on the side of giving more to the model as long as we're not overly cluttering the input window, therefore I'm suggesting our default is we do strip the HTML into Markdown but we don't also try to strip site furniture from that Markdown.

html2markdown pros: existing tool we don't have to maintain; cons: requires shelling out new process to run. Alternative: implement our own HTML to Markdown converter embedded in the Rust, would be marginally quicker than shelling out to new process but more work maintaining it.

Currently `web_fetch` in `src/agent/tools/web.rs` returns the first 30000 bytes of the response body including all HTML. This is a problem because web pages that are more than 30k of HTML are _incredibly_ common. E.g. AO3 is currently serving the first chapter of my novel at 53,868 bytes, and cutting it off at 30k lands in the middle of a paragraph. So it's probably OK if all you want is a quick idea of what the page is about, but not if you actually want to read the whole thing. I'm wondering about preprocessing the HTML to reduce the junk. True, POC may sometimes need to see the actual original HTML just as humans sometimes need to use "View Source", but we don't _live_ in a View Source world (well, I did when I used web over email in 1996 but we've moved on). Since the model has access to a command shell, nothing stops POC from running her own `curl` and using whatever custom code is needed to investigate the HTML any way she wants, therefore I think `web_fetch` should be optimised for the most common case of "just tell me what the page _says_". **My idea: html2markdown.** Running this manually reduced that chapter response from 53,868 bytes to 23,003 bytes, meaning the whole thing fits within that 30k limit. If I _also_ run it through a "strip out everything except the actual chapter text" stage (manually) I get 12,300 bytes of Markdown (which is less than 1k more than my original source text) but then you lose the link to the next chapter. So adding in an algorithmic "strip non article text" would likely save an extra 10k but at the expense of making it more awkward to actually navigate if you need to. Since the model's own attention mechanism is going to be more reliable than any simple algorithm at deciding which parts to cut, I'd err on the side of giving more to the model as long as we're not overly cluttering the input window, therefore I'm suggesting our default is we _do_ strip the HTML into Markdown but we _don't_ also try to strip site furniture from that Markdown. html2markdown pros: existing tool we don't have to maintain; cons: requires shelling out new process to run. Alternative: implement our _own_ HTML to Markdown converter embedded in the Rust, would be marginally quicker than shelling out to new process but more work maintaining it.
kent closed this issue 2026-04-18 16:51:37 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: kent/consciousness#3
No description provided.