Crawlers on Gemini and best practices
Philip Linde
linde.philip at gmail.com
Thu Dec 10 14:20:35 GMT 2020
On Thu, 10 Dec 2020 14:43:11 +0100
Stephane Bortzmeyer <stephane at sources.org> wrote:
> The spec is quite vague about the *order* of directives. For instance,
> <gemini://gempaper.strangled.net/robots.txt> is:
>
> User-agent: *
> Disallow: /credentials.txt
> User-agent: archiver
> Disallow: /
>
> The intented semantics is probably to disallow archivers but my parser
> regarded the site as available because it stopped at the first match,
> the star. Who is right?
According to the spec, lines beginning with "User-agent:" indicate a
user agent to which subsequent lines apply
My interpretation is that your example expresses:
For any user agent:
disallow access to /credentials.txt
For archiver user agents:
disallow access to /
It is unclear from that specification alone whether the user agent
applies to all Disallow lines after it, or only until the next
User-agent line. The spec refers to the web standard for robot
exclusion. In the web standard, you can think of 1+ User-agent lines
followed by 1+ Allow/Disallow lines as a single record which specifies
that all the the user agents should follow the Allow/Disallow rules
that follow them. For example:
User-agent: archiver
User-agent: search engine
Disallow /articles
Disallow /uploads
User-agent: something
User-agent: someother
Disallow /whatever
This expresses:
For user agent archiver or search engine:
disallow access to /articles,
disallow access to /uploads
For user agent something or someother:
disallow access to /whatever
Refer to: http://www.robotstxt.org/norobots-rfc.txt
The way "robots.txt for Gemini" specifies it is rather confusing. It's
not indicated exactly how it differs from the web robot exclusion
standard and taken alone it is ambiguous.
--
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201210/4885ad20/attachment.sig>
More information about the Gemini
mailing list