Crawlers on Gemini and best practices
Robert "khuxkm" Miles
khuxkm at tilde.team
Thu Dec 10 21:44:50 GMT 2020
December 10, 2020 8:43 AM, "Stephane Bortzmeyer" <stephane at sources.org> wrote:
> - snip -
>
> The spec is quite vague about the *order* of directives. For instance,
> <gemini://gempaper.strangled.net/robots.txt> is:
>
> User-agent: *
> Disallow: /credentials.txt
> User-agent: archiver
> Disallow: /
>
> The intented semantics is probably to disallow archivers but my parser
> regarded the site as available because it stopped at the first match,
> the star. Who is right?
Not you. The idea is that you start with the most direct User-Agent that applies to you (in this case, `archiver`), and then if that doesn't say you can't access the file, go up a level (in this case, `*`), and if *that* doesn't say you can't access the file, go up another level, and so on. If you run out of levels and haven't been told you can't access a file (or a parent directory of the file) then go ahead and access the file. In this case, a web proxy has no specific rules, so it would follow the `*` rule and not serve credentials.txt. However, an archiver *does* have specific rules: don't access anything, and so the archiver doesn't access anything.
An example in "A Standard For Robot Exclusion" (as archived by robotstxt.org[1]) uses an ordering like this and insinuates that the intent is to (in this case) allow `cybermapper` to access everything but block anyone else from accessing the cyberworld application:
```
# robots.txt for http://www.example.com/
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:
```
Just my 2 cents,
Robert "khuxkm" Miles
More information about the Gemini
mailing list