Web proxies and robots.txt (was Re: Heads up about a Gemini client @ 198.12.83.123)
Robert "khuxkm" Miles
khuxkm at tilde.team
Mon Nov 30 02:35:14 GMT 2020
November 29, 2020 9:25 PM, "Sean Conner" <sean at conman.org> wrote:
> It was thus said that the Great Sean Conner once stated:
>
>> It's not threatening my server or anything, but who ever is responsible
>> for the client at 198.12.83.123, your client is currently stuck in the
>> Redirection From Hell test and has been for some time. From the length of
>> time, it appears to be running autonomously so perhaps a leftover thread, or
>> an autonomous client that doesn't read robots.txt, or didn't follow the spec
>> carefully enough.
>>
>> Anyway, just a heads up.
>>
>> -spc
>
> So the client in question was most likely a web proxy. I'm not sure what
> site, nor the software used, but it did response to a Gemini request with
> "53 Proxy Requet Refused" so there *is* a Gemini server there. And given
> that it made 137,060 requests before I shut down my own server told me that
> it was an autonomous agent that no one was watching. Usually, I may see a
> client hit 20 or 30 times before it stops. Not this one.
>
> Now granted, my server is a bit unique in that I have tests set up
> specifically for clients to test against, and several of them involve
> infinite redirects. And yes, that was 137,060 *unique* requests.
>
> So first up, Solderpunk, if you could please add a redirection follow
> limit to the specification and make it mandatory. You can specify some
> two, heck, even three digit number to follow, but please, *please*, add it
> to the specification and *not* just the best practices document to make
> programmers aware of the issue. It seems like it's too easy to overlook
> this potential trap (I see it often enough).
>
> Second, had the proxy in question fetched robots.txt, I had this area
> specifically marked out:
>
> User-agent: *
> Disallow: /test/redirehell
>
> I have that for a reason, and had the autonomous client in question read
> it, this wouldn't have happened in the first place. Even if you disagree
> with this, it may be difficult to stop an autonomous agent once the user of
> said web proxy has dropped the web connection. I don't know, I haven't
> written a web proxy, and this is one more thing to keep in mind when writing
> one. I think it would be easier to follow robots.txt.
>
> -spc (To the person who called me a dick for blocking a web proxy---yes,
> there *are* reasons to block them)
I recently wrote a gemini to web proxy as a simple side-project to see how easy it would be to create, and one thing I implemented that I feel should be a standard for web proxies is not handling redirects internally. If you tell my gemini proxy to request a page that offers a redirect (say, the next page link for LEO), it will send you back a small web page saying "hey, the site at this URL wants to send you to this other URL, do you want to follow that redirect or nah?" (not exact wording but you get my drift). That is, if you attempt to access the Redirection from Hell test using my proxy, each and every redirect would be a "confirm redirect" page served to the user. After about 20 pages, you'd think the user would catch on. That being said, my gemini proxy is not linked anywhere on my website (and if it were in a place I would link publically I would use robots.txt to prevent web crawlers from accessing it), so perhaps I'm not the target of this message.
I still maintain that a proxy is a direct agent of a user, and not an automated client. Proxy authors should use robots.txt on the web side to block crawlers from accessing the proxy, but proxies shouldn't have to follow robots.txt.
It's actually easier to just write your web proxy in such a way that this doesn't happen to you.
Just my two cents,
Robert "khuxkm" Miles
More information about the Gemini
mailing list