robots.txt for Gemini
solderpunk
solderpunk at SDF.ORG
Sun Mar 22 18:13:33 GMT 2020
Howdy all,
As the first and perhaps most important push toward getting some clear
guidelines in place for well-behaved non-human Gemini clients (e.g.
Gemini to web proxies, search engine spiders, feed aggregators, etc.),
let's get to work on adapting robots.txt to Gemini.
My current thinking is that this doesn't belong in the Gemini spec
itself, much like robots.txt does not belong in the HTTP spec. That
said, this feels like it warrants something more than just being put in
the Best Practices doc. Maybe we need to start working on official
"side specs", too. Not sure what these should be called.
Anyway, I've refamiliarised myself with robots.txt. Turns out it is
still only a de facto standard without an official RFC. My
understanding is based on:
* The https://www.robotstxt.org/ website (which Wikipedia calls the
"Official website" at
https://en.wikipedia.org/wiki/Robots_exclusion_standard - it's not
clear to me what "official" means for a de facto standard), and in
particular:
* An old draft RFC from 1996 which that site hosts at
https://www.robotstxt.org/norobots-rfc.txt
* A new draft RFC from 2019 which appears to have gotten further than
the first, considering it is hosted by the IETF at
https://tools.ietf.org/html/draft-rep-wg-topic-00
While the 1996 draft is web-specific, I was pleasantly surprised to
see that the 2019 version is not. Section 2.3 says:
> As per RFC3986 [1], the URI of the robots.txt is:
>
> "scheme:[//authority]/robots.txt"
>
> For example, in the context of HTTP or FTP, the URI is:
>
> http://www.example.com/robots.txt
>
> https://www.example.com/robots.txt
>
> ftp://ftp.example.com/robots.txt
So, why not Gemini too?
Regarding the first practical question which was raised by Sean's recent
post, it seems a no-brainer to me that Gemini should retain the
convention of there being a single /robots.txt URL rather than having
them per-directory or anything like that. Which it now seems was the
intended behaviour of GUS all along, so I'm guessing nobody will find
this controversial (but speak up if you do).
https://www.robotstxt.org/robotstxt.html claims that excluding all files
except one from robot access is "currently a bit awkward, as there is no
"Allow" field". However, both the old and new RFC drafts clearly
mention one. I am not sure exactly what the ground truth is here, in
terms of how often Allow is used in the wild or to what extent it is
obeyed even by well-intentiond bots. I would be very happy in principle
to just declare that Allow lines are valid for Gemini robot.txt files,
but if it turns out that popular programming languages have standard
library tools for parsing robot.txt which don't choke on gemini:// URLs
but don't recognise "Allow", this could quickly lead to unintended
consequences, so perhaps it is best to be conservative here.
If anybody happens to be familiar with current practice on the web with
regard to Allow, please chime in.
There is the question of caching. Both RFC drafts for robots.txt make
it clear that standard HTTP caching mechanisms apply to robots.txt, but
Gemini doesn't have an equivalent and I'm not interested in adding one
yet, especially not for the purposes of robots.txt. And yet, obviously,
*some* caching needs to take place. A spider requesting /robots.txt
again and again for every document at a host is generating a lot of
needless traffic. The 1996 RFC recommends "If no cache-control
directives are present robots should default to an expiry of 7 days",
while the 2019 one says "Crawlers SHOULD NOT use the cached version for
more than 24 hours, unless the robots.txt is unreachable". My gut tells
me most Gemini robots.txt files will change very infrequently and 7 days
is more appropriate than 24 hours, but I'm happy for us to discuss this.
The biggest question, in my mind, is what to do about user-agents, which
Gemini lacks (by design, as they are a component of the browser
fingerprinting problem, and because they encourage content developers to
serve browser-specific content which is a bad thing IMHO). The 2019 RFC
says "The product token SHOULD be part of the identification string that
the crawler sends to the service" (where "product token" is bizarre and
disappointingly commercial alternative terminology for "user-agent" in
this document), so the fact that Gemini doesn't send one is not
technically a violation.
Of course, a robot doesn't need to *send* its user-agent in order to
*know* its user-agent and interpet robots.txt accordingly. But it's
much harder for Gemini server admins than their web counterparts to know
exactly which bot is engaging in undesired behaviour and how to address
it. Currently, the only thing that seems achievable in Gemini is to use
the wildcard user-agent "*" to allow/disallow access by *all* bots to
particular resources.
But not all bots are equal. I'm willing to bet there are people using
Gemini who are perfectly happy with e.g. the GUS search engine spider
crawling their site to make it searchable via a service which is offered
exclusively within Geminispace, but who are not happy with Gemini to web
proxies accessing their content because they are concerned that
poorly-written proxies will not disallow Google from crawling them so
that Gemini content ends up being searchable within webspace. This is a
perfectly reasonable stance to take and I think we should try to
facilitate it.
With no Gemini-specific changes to the de facto robots.txt spec, this
would require admins to either manually maintain a whitelist of
Gemini-only search engine spiders in their robots.txt *or* a blacklist
of web proxies. This is easy today when you can count the number of
either of things on one hand, but it does not scale well and is not a
reasonable thing to expect admins to do in order to enforce a reasonable
stance.
(and, really, this isn't a Gemini-specific problem and I'm surprised
that what I'm about to propose isn't a thing for the web)
I have mentioned previously on this list (quickly, in passing), the idea
of "meta user-agents" (I didn't use that term when I first mentioned
it). But since there is no way for Gemini server admins to learn the
user-agent of arbitrary bots, we could define a small (I'm thinking ~5
would suffice, surely 10 at most) number of pre-defined user-agents
which all bots of a given kind MUST respect (in addition to optionally
having their own individual user-agent). A very rough sketch of some
possibilities, not meant to be exhaustive or even very good, just to
give the flavour:
* A user-agent of "webproxy" which must be respected by all web proxies.
Possibly this could have sub-types for proxies which do and don't
forbid web search engines?
* A user-agent of "search" which must be respected by all search engine
spiders
* A user-agent of "research" for bots which crawl a site without making
specific results of their crawl publically available (I've thought of
writing something like this to study the growth of Geminispace and the
structure of links between documents)
Enumerating actual use cases is probably the wrong way to go about it,
rather we should think of broad classes of behaviour which differ with
regard to privacy implications - e.g. bots which don't make the results
of their crawling public, bots which make their results public over
Gemini only, bots which breach the Gemini-web barrier, etc.
Do people think this is a good idea?
Can anybody think of other things to consider in adapting robots.txt to
Gemini?
Cheers,
Solderpunk
More information about the Gemini
mailing list