Crawlers on Gemini and best practices
Stephane Bortzmeyer
stephane at sources.org
Thu Dec 10 13:43:11 GMT 2020
On Tue, Dec 08, 2020 at 09:47:57AM -0500,
Natalie Pendragon <natpen at natpen.net> wrote
a message of 32 lines which said:
> Yes, you should respect robots.txt in my opinion. It's not compulsory,
> but it's currently the best way we have to respect servers' wishes and
> bandwidth constraints. There is even a companion spec for doing so,
> which accompanies the main Gemini spec.
>
> gemini://gemini.circumlunar.space/docs/companion/robots.gmi
The spec is quite vague about the *order* of directives. For instance,
<gemini://gempaper.strangled.net/robots.txt> is:
User-agent: *
Disallow: /credentials.txt
User-agent: archiver
Disallow: /
The intented semantics is probably to disallow archivers but my parser
regarded the site as available because it stopped at the first match,
the star. Who is right?
<gemini://gemini.circumlunar.space/docs/companion/robots.gmi> and
<http://www.robotstxt.org/robotstxt.html> are unclear.
More information about the Gemini
mailing list