robots.txt for Gemini formalised
Sean Conner
sean at conman.org
Sun Nov 22 22:59:42 GMT 2020
It was thus said that the Great Solderpunk once stated:
> Hi folks,
>
> There is now (finally!) an official reference on the use of robots.txt
> files in Geminispace. Please see:
>
> gemini://gemini.circumlunar.space/docs/companion/robots.gmi
Nice.
> I attempted to take into account previous discussions on the mailing
> list and the currently declared practices of various well-known Gemini
> bots (broadly construed).
>
> I don't consider this "companion spec" to necessarily be finalised at
> this point, but I am primarily interested in hearing suggestions for
> change from either authors of software which tries to respect robots.txt
> who are having problems caused by the current specification, or from
> server admins who are having bot problems who feel that the current
> specification is not working for them.
Right now, there are two things I would change.
1. Add "allow". While the initial spec [1] did not have an allow
rule, a subsequent draft proposal [2] did, which Google is
pushing (as of 2019) to become an RFC [3].
2. I would specify virtual agents as:
Virtual-agent: archiver
Virtual-agent: indexer
This makes it easier to add new virtual agents, separates the
namespace of agents from the namespace of virtual agents, and is
allowed by all current and proposed standards [4].
The rule I would follow is:
Definitions:
specific user agent is one that is not '*'
specific virtual agent is one that is not '*'
generic user agent is one that is specified as '*'
generic virtual agent is one that is '*'
A crawler should use a block of rules:
if it finds a specific user agent (most targetted)
or it finds a specific virtual agent
or it finds a generic virtual agent
or it finds a generic user agent (least targetted)
I'm wavering on the generic virtual agent bit, so if you think
that makes this too complicated, fine, I think it can go.
> The biggest gap that I can currently see is that there is no advice on
> how often bots should re-query robots.txt to check for policy changes.
> I could find no clear advice on this for the web, either. I would be
> happy to hear from people who've written software that uses robots.txt
> with details on what their current practices are in this respect.
The Wikipedia page [5] lists a non-standard extension "Crawl-delay" which
informs a crawler how often they should make requests. It might be easy to
add a field saying how often to fetch a resource. A sample file:
# The GUS agent, plus any agent that identifies as an "indexer" is allowed
# one path in an otherwise disallowed place, and only fetch items in 10
# second increments.
User-agent: GUS
Virtual-agent: indexer
Allow: /private/butpublic
Disallow: /private
Crawl-delay: 10
# Agents that fetch feeds, should only grab every 6 hours. "Check" is
# allowed as agents should ignore fields it doesn't understand.
Virtual-agent: feed
Disallow: /private
Check: 21600
# And a fallback. Here we don't allow any old crawler into the private
# space, and we force them to use 20 seonds between fetches.
User-agent: *
Disallow: /private
Crawl-delay: 20
-spc
[1] gemini://gemini.circumlunar.space/docs/companion/robots.gmi
[2] http://www.robotstxt.org/norobots-rfc.txt
[3] https://developers.google.com/search/reference/robots_txt
[4] Any field not understood by a crawler should be ignored.
[5] https://en.wikipedia.org/wiki/Robots_exclusion_standard
More information about the Gemini
mailing list