Crawlers on Gemini and best practices

Robert "khuxkm" Miles khuxkm at tilde.team
Thu Dec 10 21:44:50 GMT 2020


December 10, 2020 8:43 AM, "Stephane Bortzmeyer" <stephane at sources.org> wrote:

> - snip -
> 
> The spec is quite vague about the *order* of directives. For instance,
> <gemini://gempaper.strangled.net/robots.txt> is:
> 
> User-agent: *
> Disallow: /credentials.txt
> User-agent: archiver
> Disallow: /
> 
> The intented semantics is probably to disallow archivers but my parser
> regarded the site as available because it stopped at the first match,
> the star. Who is right?

Not you. The idea is that you start with the most direct User-Agent that applies to you (in this case, `archiver`), and then if that doesn't say you can't access the file, go up a level (in this case, `*`), and if *that* doesn't say you can't access the file, go up another level, and so on. If you run out of levels and haven't been told you can't access a file (or a parent directory of the file) then go ahead and access the file. In this case, a web proxy has no specific rules, so it would follow the `*` rule and not serve credentials.txt. However, an archiver *does* have specific rules: don't access anything, and so the archiver doesn't access anything.

An example in "A Standard For Robot Exclusion" (as archived by robotstxt.org[1]) uses an ordering like this and insinuates that the intent is to (in this case) allow `cybermapper` to access everything but block anyone else from accessing the cyberworld application:

```
# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:
```

Just my 2 cents,
Robert "khuxkm" Miles


More information about the Gemini mailing list