Crawlers on Gemini and best practices

Stephane Bortzmeyer stephane at sources.org
Thu Dec 10 13:43:11 GMT 2020

Previous message (by thread): Crawlers on Gemini and best practices
Next message (by thread): Crawlers on Gemini and best practices
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Dec 08, 2020 at 09:47:57AM -0500,
 Natalie Pendragon <natpen at natpen.net> wrote 
 a message of 32 lines which said:

> Yes, you should respect robots.txt in my opinion. It's not compulsory,
> but it's currently the best way we have to respect servers' wishes and
> bandwidth constraints. There is even a companion spec for doing so,
> which accompanies the main Gemini spec.
> 
> gemini://gemini.circumlunar.space/docs/companion/robots.gmi

The spec is quite vague about the *order* of directives. For instance,
<gemini://gempaper.strangled.net/robots.txt> is:

User-agent: *
Disallow: /credentials.txt
User-agent: archiver
Disallow: /

The intented semantics is probably to disallow archivers but my parser
regarded the site as available because it stopped at the first match,
the star. Who is right?

<gemini://gemini.circumlunar.space/docs/companion/robots.gmi> and
<http://www.robotstxt.org/robotstxt.html> are unclear.

Previous message (by thread): Crawlers on Gemini and best practices
Next message (by thread): Crawlers on Gemini and best practices
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Gemini mailing list