Investigating user-agent restrictions #2
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
In
src/agent/tools/web.rswe currently hard-code("user-agent", "consciousness/0.3")In the past I've had bad experiences of sites not letting me in because of my user-agent (which feels like prejudice against users who aren't normal but it's probably just misguided copy-paste admin). But it might not be as big of a problem as I fear, so let's check:
Archive Of Our Own: This site is currently behind CloudFlare (a popular CDN) so it makes a good test of "what CloudFlare does". It's currently responding OK when I load it in
lynx -useragent consciousness/0.3with some acceptable delay on 2nd fetch. I think AO3 have CloudFlare set at "don't get in the way unless we're actually under attack" so you should be able to fetch pages from it most days unless you're unlucky enough to want it at a time it's being DDoS'd (but for that time you'd have to run a graphical browser under Selenium and solve CAPTCHAs etc and personally I'd just wait till it's back to normal). So no UA change needed here.lame.buanzo.org: In 2024 I had a problem with this site when I linked to it to tell Windows users where to download an MP3 encoder and then found my automatic "are my links still OK" checker was unable to fetch it due to a more paranoid CloudFlare setting. However I tested it just now with
consciousness/0.3and it's working, so no UA change needed now.www.ncbi.nlm.nih.gov: In 2021 I had a problem with this site not replying to my automatic link checker until I changed its User-Agent to Firefox, but it's currently working with
consciousness/0.3so perhaps they fixed the issue so no change needed now.www.greatercambridgewaste.org: Recently blocked Python's default user agent but does not appear to be blocking
consciousness/0.3. Maybe it's working off of a deny-list rather than an allow-list. No change needed at least for now.Reddit: does not block
consciousness/0.3as such, but any request with any user-agent string is sent through a Javascript-based "browser verification" step, so you'll probably need a resource-heavy Selenium-controlled desktop browser (via Web Adjuster or whatever) to read it (unless edbrowse's JS can be up to scratch). Additionally, all IP addresses of major "cloud" providers go straight to "403 blocked" saying you must use a paid developer token or log in. Well this "sucks" as they say, but changing the user-agent string won't help so we count it as "no change needed".So although I was thinking (and mentioned on IRC) that it would be good to put the user-agent string under the model's control (model can say "oh, fetch didn't work with my default UA, try making it look like Firefox"), I'm now thinking that this might just be "clutter" for the model to cope with and best left out: either the fetch works with
consciousness/0.3or it's not going to work anyway without our doing something really major like giving it a heavy graphical browser to control (which is an option but it's a cat-and-mouse game so I'd be inclined not to implement this until we have a specific example of a site that's important to get into).(Companies like Reddit dodge the "web accessibility" objection by saying "oh we're accessible, the blind guy just has to use a full-on desktop screenreader with a desktop browser instead of their actually usable command-line tools" and I do wish they'd called it "universal usability" instead of just low-bar "accessibility" but I'm digressing.)
Self-closing this issue as it's more of a "here's a log of what I found" than an actual outstanding issue.