Web proxies and robots.txt (was Re: Heads up about a Gemini client @ 198.12.83.123)
Sean Conner
sean at conman.org
Mon Nov 30 03:30:42 GMT 2020
It was thus said that the Great Robert khuxkm Miles once stated:
> November 29, 2020 9:25 PM, "Sean Conner" <sean at conman.org> wrote:
>
> > It was thus said that the Great Sean Conner once stated:
> >
> >> It's not threatening my server or anything, but who ever is responsible
> >> for the client at 198.12.83.123, your client is currently stuck in the
> >> Redirection From Hell test and has been for some time. From the length of
> >> time, it appears to be running autonomously so perhaps a leftover thread, or
> >> an autonomous client that doesn't read robots.txt, or didn't follow the spec
> >> carefully enough.
> >>
> >> Anyway, just a heads up.
> >>
> >> -spc
> >
> > So the client in question was most likely a web proxy. I'm not sure what
> > site, nor the software used, but it did response to a Gemini request with
> > "53 Proxy Requet Refused" so there *is* a Gemini server there. And given
> > that it made 137,060 requests before I shut down my own server told me that
> > it was an autonomous agent that no one was watching. Usually, I may see a
> > client hit 20 or 30 times before it stops. Not this one.
> >
> > Now granted, my server is a bit unique in that I have tests set up
> > specifically for clients to test against, and several of them involve
> > infinite redirects. And yes, that was 137,060 *unique* requests.
> >
> > So first up, Solderpunk, if you could please add a redirection follow
> > limit to the specification and make it mandatory. You can specify some
> > two, heck, even three digit number to follow, but please, *please*, add it
> > to the specification and *not* just the best practices document to make
> > programmers aware of the issue. It seems like it's too easy to overlook
> > this potential trap (I see it often enough).
> >
> > Second, had the proxy in question fetched robots.txt, I had this area
> > specifically marked out:
> >
> > User-agent: *
> > Disallow: /test/redirehell
> >
> > I have that for a reason, and had the autonomous client in question read
> > it, this wouldn't have happened in the first place. Even if you disagree
> > with this, it may be difficult to stop an autonomous agent once the user of
> > said web proxy has dropped the web connection. I don't know, I haven't
> > written a web proxy, and this is one more thing to keep in mind when writing
> > one. I think it would be easier to follow robots.txt.
> >
> > -spc (To the person who called me a dick for blocking a web proxy---yes,
> > there *are* reasons to block them)
>
> I recently wrote a gemini to web proxy as a simple side-project to see how
> easy it would be to create, and one thing I implemented that I feel should
> be a standard for web proxies is not handling redirects internally. If you
> tell my gemini proxy to request a page that offers a redirect (say, the
> next page link for LEO), it will send you back a small web page saying
> "hey, the site at this URL wants to send you to this other URL, do you
> want to follow that redirect or nah?" (not exact wording but you get my
> drift). That is, if you attempt to access the Redirection from Hell test
> using my proxy, each and every redirect would be a "confirm redirect" page
> served to the user. After about 20 pages, you'd think the user would catch
> on. That being said, my gemini proxy is not linked anywhere on my website
> (and if it were in a place I would link publically I would use robots.txt
> to prevent web crawlers from accessing it), so perhaps I'm not the target
> of this message.
You, specifically, weren't the target for my last bit, but I am addressing
in general those who write webproxies for Gemini. Your proxy's method of
handing redirects works. I was just a bit upset that an agent out there
made 137,000 requests [1] before anyone did anything about it.
> I still maintain that a proxy is a direct agent of a user, and not an
> automated client. Proxy authors should use robots.txt on the web side to
> block crawlers from accessing the proxy, but proxies shouldn't have to
> follow robots.txt.
I understand the argument, but I can't say I'm completely on board with it
either, because ...
> It's actually easier to just write your web proxy in such a way that this
> doesn't happen to you.
you would probably be amazed at just how often clients *don't* limit
following redirects. Most of the time, someone is sitting there, watching
their client and stopping it after perhaps 30 seconds, fix the first
redirect issue (redirecting back to itself) only to get trapped at the next
step.
And to think I brought this upon myself for wanting redirects in the first
place.
-spc (How ironic)
[1] For the record, it was NOT placing an undue burden on my server,
just cluttering the log files. It's only an issue when the log file
gets to 2G in size, then logging stops for everything.
More information about the Gemini
mailing list