robots.txt for Gemini

solderpunk solderpunk at SDF.ORG
Tue Mar 24 19:54:04 GMT 2020


On Sun, Mar 22, 2020 at 10:39:05PM +0000, Krixano wrote:
> You can go to literally any website and append "/robots.txt" to see what they use. I've already seen a couple that use "Allow".

Thanks for that. I've also verified that Python's stdlib function for
parsing robots.txt recognises "Allow", so it seems that this is not as
obscure an option as some sources suggest.  I'd say we may as well
explicitly support this for Gemini.

Cheers,
Solderpunk


> 
> Christian Seibold
> 
> Sent with ProtonMail Secure Email.
> 
> ??????? Original Message ???????
> On Sunday, March 22, 2020 1:13 PM, solderpunk <solderpunk at SDF.ORG> wrote:
> 
> > Howdy all,
> >
> > As the first and perhaps most important push toward getting some clear
> > guidelines in place for well-behaved non-human Gemini clients (e.g.
> > Gemini to web proxies, search engine spiders, feed aggregators, etc.),
> > let's get to work on adapting robots.txt to Gemini.
> >
> > My current thinking is that this doesn't belong in the Gemini spec
> > itself, much like robots.txt does not belong in the HTTP spec. That
> > said, this feels like it warrants something more than just being put in
> > the Best Practices doc. Maybe we need to start working on official
> > "side specs", too. Not sure what these should be called.
> >
> > Anyway, I've refamiliarised myself with robots.txt. Turns out it is
> > still only a de facto standard without an official RFC. My
> > understanding is based on:
> >
> > -   The https://www.robotstxt.org/ website (which Wikipedia calls the
> >     "Official website" at
> >     https://en.wikipedia.org/wiki/Robots_exclusion_standard - it's not
> >     clear to me what "official" means for a de facto standard), and in
> >     particular:
> >
> > -   An old draft RFC from 1996 which that site hosts at
> >     https://www.robotstxt.org/norobots-rfc.txt
> >
> > -   A new draft RFC from 2019 which appears to have gotten further than
> >     the first, considering it is hosted by the IETF at
> >     https://tools.ietf.org/html/draft-rep-wg-topic-00
> >
> >     While the 1996 draft is web-specific, I was pleasantly surprised to
> >     see that the 2019 version is not. Section 2.3 says:
> >
> >
> > > As per RFC3986 [1], the URI of the robots.txt is:
> > > "scheme:[//authority]/robots.txt"
> > > For example, in the context of HTTP or FTP, the URI is:
> > > http://www.example.com/robots.txt
> > > https://www.example.com/robots.txt
> > > ftp://ftp.example.com/robots.txt
> >
> > So, why not Gemini too?
> >
> > Regarding the first practical question which was raised by Sean's recent
> > post, it seems a no-brainer to me that Gemini should retain the
> > convention of there being a single /robots.txt URL rather than having
> > them per-directory or anything like that. Which it now seems was the
> > intended behaviour of GUS all along, so I'm guessing nobody will find
> > this controversial (but speak up if you do).
> >
> > https://www.robotstxt.org/robotstxt.html claims that excluding all files
> > except one from robot access is "currently a bit awkward, as there is no
> > "Allow" field". However, both the old and new RFC drafts clearly
> > mention one. I am not sure exactly what the ground truth is here, in
> > terms of how often Allow is used in the wild or to what extent it is
> > obeyed even by well-intentiond bots. I would be very happy in principle
> > to just declare that Allow lines are valid for Gemini robot.txt files,
> > but if it turns out that popular programming languages have standard
> > library tools for parsing robot.txt which don't choke on gemini:// URLs
> > but don't recognise "Allow", this could quickly lead to unintended
> > consequences, so perhaps it is best to be conservative here.
> >
> > If anybody happens to be familiar with current practice on the web with
> > regard to Allow, please chime in.
> >
> > There is the question of caching. Both RFC drafts for robots.txt make
> > it clear that standard HTTP caching mechanisms apply to robots.txt, but
> > Gemini doesn't have an equivalent and I'm not interested in adding one
> > yet, especially not for the purposes of robots.txt. And yet, obviously,
> > some caching needs to take place. A spider requesting /robots.txt
> > again and again for every document at a host is generating a lot of
> > needless traffic. The 1996 RFC recommends "If no cache-control
> > directives are present robots should default to an expiry of 7 days",
> > while the 2019 one says "Crawlers SHOULD NOT use the cached version for
> > more than 24 hours, unless the robots.txt is unreachable". My gut tells
> > me most Gemini robots.txt files will change very infrequently and 7 days
> > is more appropriate than 24 hours, but I'm happy for us to discuss this.
> >
> > The biggest question, in my mind, is what to do about user-agents, which
> > Gemini lacks (by design, as they are a component of the browser
> > fingerprinting problem, and because they encourage content developers to
> > serve browser-specific content which is a bad thing IMHO). The 2019 RFC
> > says "The product token SHOULD be part of the identification string that
> > the crawler sends to the service" (where "product token" is bizarre and
> > disappointingly commercial alternative terminology for "user-agent" in
> > this document), so the fact that Gemini doesn't send one is not
> > technically a violation.
> >
> > Of course, a robot doesn't need to send its user-agent in order to
> > know its user-agent and interpet robots.txt accordingly. But it's
> > much harder for Gemini server admins than their web counterparts to know
> > exactly which bot is engaging in undesired behaviour and how to address
> > it. Currently, the only thing that seems achievable in Gemini is to use
> > the wildcard user-agent "*" to allow/disallow access by all bots to
> > particular resources.
> >
> > But not all bots are equal. I'm willing to bet there are people using
> > Gemini who are perfectly happy with e.g. the GUS search engine spider
> > crawling their site to make it searchable via a service which is offered
> > exclusively within Geminispace, but who are not happy with Gemini to web
> > proxies accessing their content because they are concerned that
> > poorly-written proxies will not disallow Google from crawling them so
> > that Gemini content ends up being searchable within webspace. This is a
> > perfectly reasonable stance to take and I think we should try to
> > facilitate it.
> >
> > With no Gemini-specific changes to the de facto robots.txt spec, this
> > would require admins to either manually maintain a whitelist of
> > Gemini-only search engine spiders in their robots.txt or a blacklist
> > of web proxies. This is easy today when you can count the number of
> > either of things on one hand, but it does not scale well and is not a
> > reasonable thing to expect admins to do in order to enforce a reasonable
> > stance.
> >
> > (and, really, this isn't a Gemini-specific problem and I'm surprised
> > that what I'm about to propose isn't a thing for the web)
> >
> > I have mentioned previously on this list (quickly, in passing), the idea
> > of "meta user-agents" (I didn't use that term when I first mentioned
> > it). But since there is no way for Gemini server admins to learn the
> > user-agent of arbitrary bots, we could define a small (I'm thinking ~5
> > would suffice, surely 10 at most) number of pre-defined user-agents
> > which all bots of a given kind MUST respect (in addition to optionally
> > having their own individual user-agent). A very rough sketch of some
> > possibilities, not meant to be exhaustive or even very good, just to
> > give the flavour:
> >
> > -   A user-agent of "webproxy" which must be respected by all web proxies.
> >     Possibly this could have sub-types for proxies which do and don't
> >     forbid web search engines?
> >
> > -   A user-agent of "search" which must be respected by all search engine
> >     spiders
> >
> > -   A user-agent of "research" for bots which crawl a site without making
> >     specific results of their crawl publically available (I've thought of
> >     writing something like this to study the growth of Geminispace and the
> >     structure of links between documents)
> >
> >     Enumerating actual use cases is probably the wrong way to go about it,
> >     rather we should think of broad classes of behaviour which differ with
> >     regard to privacy implications - e.g. bots which don't make the results
> >     of their crawling public, bots which make their results public over
> >     Gemini only, bots which breach the Gemini-web barrier, etc.
> >
> >     Do people think this is a good idea?
> >
> >     Can anybody think of other things to consider in adapting robots.txt to
> >     Gemini?
> >
> >     Cheers,
> >     Solderpunk
> >
> 
> 
> 


More information about the Gemini mailing list