Brainstorming URL-fetch preprocessing #3
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Currently
web_fetchinsrc/agent/tools/web.rsreturns the first 30000 bytes of the response body including all HTML.This is a problem because web pages that are more than 30k of HTML are incredibly common. E.g. AO3 is currently serving the first chapter of my novel at 53,868 bytes, and cutting it off at 30k lands in the middle of a paragraph. So it's probably OK if all you want is a quick idea of what the page is about, but not if you actually want to read the whole thing.
I'm wondering about preprocessing the HTML to reduce the junk. True, POC may sometimes need to see the actual original HTML just as humans sometimes need to use "View Source", but we don't live in a View Source world (well, I did when I used web over email in 1996 but we've moved on). Since the model has access to a command shell, nothing stops POC from running her own
curland using whatever custom code is needed to investigate the HTML any way she wants, therefore I thinkweb_fetchshould be optimised for the most common case of "just tell me what the page says".My idea: html2markdown. Running this manually reduced that chapter response from 53,868 bytes to 23,003 bytes, meaning the whole thing fits within that 30k limit.
If I also run it through a "strip out everything except the actual chapter text" stage (manually) I get 12,300 bytes of Markdown (which is less than 1k more than my original source text) but then you lose the link to the next chapter. So adding in an algorithmic "strip non article text" would likely save an extra 10k but at the expense of making it more awkward to actually navigate if you need to. Since the model's own attention mechanism is going to be more reliable than any simple algorithm at deciding which parts to cut, I'd err on the side of giving more to the model as long as we're not overly cluttering the input window, therefore I'm suggesting our default is we do strip the HTML into Markdown but we don't also try to strip site furniture from that Markdown.
html2markdown pros: existing tool we don't have to maintain; cons: requires shelling out new process to run. Alternative: implement our own HTML to Markdown converter embedded in the Rust, would be marginally quicker than shelling out to new process but more work maintaining it.