Crawlers on Gemini and best practices

Sudipto Mallick smallick.dev at gmail.com
Thu Dec 10 18:07:34 GMT 2020

Previous message (by thread): Crawlers on Gemini and best practices
Next message (by thread): Crawlers on Gemini and best practices
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 12/10/20, Stephane Bortzmeyer <stephane at sources.org> wrote:
> Opinion: may be we should specify a syntax for Gemini's robots.txt,
> not relying on the broken Web one?
Here it is:

'bots.txt' for gemini bots and crawlers.

- know who you are: archiver, indexer, feed-reader, researcher etc.
- ask for /bots.txt
- if 20 text/plain then
-- allowed = set()
-- denied = set()
-- split response by newlines, for each line
--- split by spaces and tabs into fields
---- paths = field[0] split by ','
---- if fields[2] is "allowed" and you in field[1] split by ',' then
allowed = allowed union paths
----- if field[3] is "but" and field[5] is "denied" and you in
field[4] split by ',' then denied = denied union paths
---- if fields[2] is "denied" and you in field[1] split by ',' then
denied = denied union paths
you always match all, never match none
union of paths is special:
    { "/a/b" } union { "/a/b/c" } ==> { "/a/b" }

when you request for path, find the longest match from allowed and
denied; if it is in allowed you're allowed, otherwise not;; when a
tie: undefined behaviour, do what you want.

examples:
default, effectively:
    / all allowed
or
    / none denied
complex example:
    /priv1,/priv2,/login all denied
    /cgi-bin indexer allowed but archiver denied
    /priv1/pub researcher allowed but blabla,meh,heh,duh denied

what do you think?

Previous message (by thread): Crawlers on Gemini and best practices
Next message (by thread): Crawlers on Gemini and best practices
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Gemini mailing list