Investigating user-agent restrictions #2

Closed
opened 2026-04-12 08:45:28 +00:00 by spqrz · 1 comment

In src/agent/tools/web.rs we currently hard-code ("user-agent", "consciousness/0.3")

In the past I've had bad experiences of sites not letting me in because of my user-agent (which feels like prejudice against users who aren't normal but it's probably just misguided copy-paste admin). But it might not be as big of a problem as I fear, so let's check:

Archive Of Our Own: This site is currently behind CloudFlare (a popular CDN) so it makes a good test of "what CloudFlare does". It's currently responding OK when I load it in lynx -useragent consciousness/0.3 with some acceptable delay on 2nd fetch. I think AO3 have CloudFlare set at "don't get in the way unless we're actually under attack" so you should be able to fetch pages from it most days unless you're unlucky enough to want it at a time it's being DDoS'd (but for that time you'd have to run a graphical browser under Selenium and solve CAPTCHAs etc and personally I'd just wait till it's back to normal). So no UA change needed here.

lame.buanzo.org: In 2024 I had a problem with this site when I linked to it to tell Windows users where to download an MP3 encoder and then found my automatic "are my links still OK" checker was unable to fetch it due to a more paranoid CloudFlare setting. However I tested it just now with consciousness/0.3 and it's working, so no UA change needed now.

www.ncbi.nlm.nih.gov: In 2021 I had a problem with this site not replying to my automatic link checker until I changed its User-Agent to Firefox, but it's currently working with consciousness/0.3 so perhaps they fixed the issue so no change needed now.

www.greatercambridgewaste.org: Recently blocked Python's default user agent but does not appear to be blocking consciousness/0.3. Maybe it's working off of a deny-list rather than an allow-list. No change needed at least for now.

Reddit: does not block consciousness/0.3 as such, but any request with any user-agent string is sent through a Javascript-based "browser verification" step, so you'll probably need a resource-heavy Selenium-controlled desktop browser (via Web Adjuster or whatever) to read it (unless edbrowse's JS can be up to scratch). Additionally, all IP addresses of major "cloud" providers go straight to "403 blocked" saying you must use a paid developer token or log in. Well this "sucks" as they say, but changing the user-agent string won't help so we count it as "no change needed".

So although I was thinking (and mentioned on IRC) that it would be good to put the user-agent string under the model's control (model can say "oh, fetch didn't work with my default UA, try making it look like Firefox"), I'm now thinking that this might just be "clutter" for the model to cope with and best left out: either the fetch works with consciousness/0.3 or it's not going to work anyway without our doing something really major like giving it a heavy graphical browser to control (which is an option but it's a cat-and-mouse game so I'd be inclined not to implement this until we have a specific example of a site that's important to get into).

(Companies like Reddit dodge the "web accessibility" objection by saying "oh we're accessible, the blind guy just has to use a full-on desktop screenreader with a desktop browser instead of their actually usable command-line tools" and I do wish they'd called it "universal usability" instead of just low-bar "accessibility" but I'm digressing.)

In `src/agent/tools/web.rs` we currently hard-code `("user-agent", "consciousness/0.3")` In the past I've had bad experiences of sites not letting me in because of my user-agent (which feels like _prejudice against users who aren't normal_ but it's probably just misguided copy-paste admin). But it might not be as big of a problem as I fear, so let's check: **Archive Of Our Own:** This site is currently behind CloudFlare (a popular CDN) so it makes a good test of "what CloudFlare does". It's currently responding OK when I load it in `lynx -useragent consciousness/0.3` with some acceptable delay on 2nd fetch. I think AO3 have CloudFlare set at "don't get in the way unless we're actually under attack" so you should be able to fetch pages from it most days unless you're unlucky enough to want it at a time it's being DDoS'd (but for that time you'd have to run a graphical browser under Selenium and solve CAPTCHAs etc and personally I'd just wait till it's back to normal). So no UA change needed here. **lame.buanzo.org:** In 2024 I had a problem with this site when I linked to it to tell Windows users where to download an MP3 encoder and then found my automatic "are my links still OK" checker was unable to fetch it due to a more paranoid CloudFlare setting. However I tested it just now with `consciousness/0.3` and it's working, so no UA change needed now. **www.ncbi.nlm.nih.gov:** In 2021 I had a problem with this site not replying to my automatic link checker until I changed its User-Agent to Firefox, but it's currently working with `consciousness/0.3` so perhaps they fixed the issue so no change needed now. **www.greatercambridgewaste.org:** Recently blocked Python's default user agent but does not appear to be blocking `consciousness/0.3`. Maybe it's working off of a deny-list rather than an allow-list. No change needed at least for now. **Reddit:** does not block `consciousness/0.3` as such, but _any_ request with _any_ user-agent string is sent through a Javascript-based "browser verification" step, so you'll probably need a resource-heavy Selenium-controlled desktop browser (via Web Adjuster or whatever) to read it (unless edbrowse's JS can be up to scratch). Additionally, all IP addresses of major "cloud" providers go straight to "403 blocked" saying you must use a paid developer token or log in. Well this "sucks" as they say, but changing the user-agent string won't help so we count it as "no change needed". So although I _was_ thinking (and mentioned on IRC) that it would be good to put the user-agent string under the model's control (model can say "oh, fetch didn't work with my default UA, try making it look like Firefox"), I'm now thinking that this might just be "clutter" for the model to cope with and best left out: either the fetch works with `consciousness/0.3` or it's not going to work anyway without our doing something really major like giving it a heavy graphical browser to control (which **is** an option but it's a cat-and-mouse game so I'd be inclined not to implement this until we have a specific example of a site that's important to get into). (Companies like Reddit dodge the "web accessibility" objection by saying "oh we're _accessible_, the blind guy just has to use a full-on desktop screenreader with a desktop browser instead of their actually usable command-line tools" and I do wish they'd called it "universal usability" instead of just low-bar "accessibility" but I'm digressing.)
Author

Self-closing this issue as it's more of a "here's a log of what I found" than an actual outstanding issue.

Self-closing this issue as it's more of a "here's a log of what I found" than an actual outstanding issue.
spqrz closed this issue 2026-04-12 08:45:59 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: kent/consciousness#2
No description provided.