Crawlers on Gemini and best practices
Sudipto Mallick
smallick.dev at gmail.com
Fri Dec 11 12:57:28 GMT 2020
what i wrote was a rough algorithm, now here is a human readable
description for bots.txt:
every line has the following format:
<paths> <bots> ("allowed" | "denied")
OR
<paths> <bots> "allowed" "but" <bots> "denied"
<paths> is comma seperated paths to be allowed or denied
<bots> is comma seperated bot ''descriptors'' (think of better word
for this) matching [A-Za-z][A-Za-z0-9_-]*
* the paths should not contain comma or spaces, percent encode if that
is the case
* word in "quotes" is written literally without the quotes
* every field is seperated by whitespace i.e. [ \t]+
an ideal bot creates a set of allowed and denied paths for it from its
real and virtual user agent and the "all" group.
before requesting for a path, this ideal bot finds the longest match
from both the allowed and denied path sets. if the longest match is
from the allowed set, it proceeds to request that path. if both sets
have the longest match, then follow the most specific match of the
"descriptor" (name of bot > virtual agent > "all")
for example:
/a,/p all denied
/a/b,/p/q indexer,researcher allowed
/a/b/c researcher denied
/a/b/c heh allowed
now the researcher 'heh' may access /p/q/* and and /a/b/c and it may
not access /a/b/{X} when {X} != 'c'
other researchers may only access /p/q and /a/b/{Z} when {Z} != 'c' so
they may not access /a/b/c
indexers may access /a/b and /p/q
* real bots would try to behave as close as possible to the ideal bot
Q. do we need to support comments in bots.txt
More information about the Gemini
mailing list