robots.txt for Gemini

Sean Conner sean at conman.org
Tue Mar 24 21:35:08 GMT 2020

Previous message (by thread): robots.txt for Gemini
Next message (by thread): robots.txt for Gemini
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

It was thus said that the Great solderpunk once stated:
> The biggest question, in my mind, is what to do about user-agents, which
> Gemini lacks (by design, as they are a component of the browser
> fingerprinting problem, and because they encourage content developers to
> serve browser-specific content which is a bad thing IMHO).  The 2019 RFC
> says "The product token SHOULD be part of the identification string that
> the crawler sends to the service" (where "product token" is bizarre and
> disappointingly commercial alternative terminology for "user-agent" in
> this document), so the fact that Gemini doesn't send one is not
> technically a violation.

  Two possible solutions for robot identification:

1) Allow IP addresses to be used where a user-agent would be specificifed. 
Some examples:

	User-agent: 172.16.89.3
	User-agent: 172.17.24.0/27
	User-agent: fde7:a680:47d3/48

Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify
a range of IP addresses.  And for a robot, if your IP addresss matches an IP
address (or range), then you need to follow the following rules.

2) Use the fragment portion of a URL to designate a robot.  The fragment
portion of a URL has no meaning for a server (it does for a client).  A
robot could use this fact to skip it its identifier when making a request. 
The server MUST NOT use this information, but the logs could show it.  For
example, a robot could request:

	gemini://example.com/robots.txt#GUS

A review of the logs would reveal that GUS is a robot, and the text "GUS"
could be placed in the User-agent: field to control it.  It SHOULD be the
text the robot would recognize in robots.txt.  One clarification, this:

	gemini://example.com/robots.txt#foo%20bot

would be 

	User-agent: foo bot

but a robot ID SHOULD NOT contain spaces---it SHOULD be one word.

  Anyway, that's my ideas.

  -spc

Previous message (by thread): robots.txt for Gemini
Next message (by thread): robots.txt for Gemini
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Gemini mailing list