robots.txt for Gemini
Sean Conner
sean at conman.org
Tue Mar 24 21:35:08 GMT 2020
It was thus said that the Great solderpunk once stated:
> The biggest question, in my mind, is what to do about user-agents, which
> Gemini lacks (by design, as they are a component of the browser
> fingerprinting problem, and because they encourage content developers to
> serve browser-specific content which is a bad thing IMHO). The 2019 RFC
> says "The product token SHOULD be part of the identification string that
> the crawler sends to the service" (where "product token" is bizarre and
> disappointingly commercial alternative terminology for "user-agent" in
> this document), so the fact that Gemini doesn't send one is not
> technically a violation.
Two possible solutions for robot identification:
1) Allow IP addresses to be used where a user-agent would be specificifed.
Some examples:
User-agent: 172.16.89.3
User-agent: 172.17.24.0/27
User-agent: fde7:a680:47d3/48
Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify
a range of IP addresses. And for a robot, if your IP addresss matches an IP
address (or range), then you need to follow the following rules.
2) Use the fragment portion of a URL to designate a robot. The fragment
portion of a URL has no meaning for a server (it does for a client). A
robot could use this fact to skip it its identifier when making a request.
The server MUST NOT use this information, but the logs could show it. For
example, a robot could request:
gemini://example.com/robots.txt#GUS
A review of the logs would reveal that GUS is a robot, and the text "GUS"
could be placed in the User-agent: field to control it. It SHOULD be the
text the robot would recognize in robots.txt. One clarification, this:
gemini://example.com/robots.txt#foo%20bot
would be
User-agent: foo bot
but a robot ID SHOULD NOT contain spaces---it SHOULD be one word.
Anyway, that's my ideas.
-spc
More information about the Gemini
mailing list