Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Sean Conner sean at conman.org
Thu Nov 26 08:09:34 GMT 2020


It was thus said that the Great Krixano once stated:
> 
> Exactly! When I first got my server up, I didn't have a robots.txt for the
> longest time. Some of my content was actually not supposed to be archived
> because it was dynamic stuff. And other stuff I didn't necessarily want
> archived.

  It is weird to think of autonomous agents crawling the Internet, but they
exist.  They can make requests just as humans (using a program) can make
requests.  The server has no concept of who or what is behind any given
request, and this is expecially true for Gemini (as it has no concept of a
user-agent identifier being sent).  This was a problem with HTTP in the
early days as well, and in 1994 (only five years after HTTP was created) an
ad-hoc method was developed to help guide autonomous agents in avoiding
particular areas that could lead to infinite holes of requests.

  Yes, it's sad that you had to learn about this the hard way.  Yes, the
Gemini spec should make mention of the robots.txt standard, and perhaps
servers can issue a warning if a robots.txt file is missing.  Or perhaps
they can include a sample robots.txt file for the end user to modify.  I
just recently added a sample robots.txt file to my server source code [1].

  I first learned of robots.txt in the 90s.  I started seeing requests to
"/robots.txt" in the logs, and curious about it, found it was an ad-hoc
standard to control autonomous agents.  I wonder if making an autonomous
agent to *just* request /robots.txt, making it show up in logs [2], will do
any good.  This is how I also found out about /humans.txt [3] (and about a
bazillion ways a web server can be exploited, but I digress).

  -spc

[1]	https://github.com/spc476/GLV-1.12556

	It's under the share directory.  But I can see that I should clarify
	one of the comments in that file, because it will only block
	autonomous agents that follow robots.txt, as it's advisory and not
	something that can be automatically enforced.

[2]	I know logging is also pretty contentious in Geminispace.

[3]	http://humanstxt.org/


More information about the Gemini mailing list