Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Nick Thomas gemini at ur.gs
Tue Nov 24 20:25:05 GMT 2020


On Tue, 2020-11-24 at 19:08 +0000, Robert "khuxkm" Miles wrote:

> Doesn't that just *feel* like a hack to you?

It definitely feels hackish when worded like this :).

The precise technical form is secondary to the outcome (as I see it) of
protecting users from a privacy-hostile default in the robots.txt
specification. I appreciate that you're currently an opt-out, rather
than opt-in, advocate, but I'd still appreciate any ideas you have to
make it nicer *if* gemini ends up going for opt-in.

An alternative form that just came to mind is a server implementation
recommendation like this:

```
Geminispace crawlers use the /robots.txt request path to determine
whether a capsule can be accessed for archival, indexing, research, and
other purposes. This can have privacy implications for the user, so
servers should not start unless they have an explicit signal on how to
handle requests to the /robots.txt path.

For example, this signal may be the availability of any content for the
/robots.txt path, a user-added database entry indicating that the path
should receive a 5x response, or a non-default configuration parameter
specifying that it's OK to skip the check.

If no such signal is present, the server should emit an error message
and either exit immediately, or allow the user to specify how the path
should be handled.
```

As a new server operator with no idea about `robots.txt`, I'd run, say:

```
$ agate [::]:1995 mysite cert.pem key.rsa ur.gs

No robots.txt file present! Please create mysite/robots.txt, or re-run
Agate with --permit-robots to allow your content to be archived,
indexed, or otherwise used by automated crawlers of Geminispace
```

*I'd* think "hang on, I don't want my content to be archived" and go
off to learn about this robots.txt thing; others might shrug and just
add the --permit-robots flag. 

>  Of the 362 hosts known to GUS, only 36 have a robots.txt file, so
> any choice made as to what the default robots.txt should be will
> affect around 90% of Geminispace 

Thanks for running the numbers on this. I agree with everything you
said based on them. That any change affects such a large proportion of
existing geminispace is especially worth emphasising.

/Nick



More information about the Gemini mailing list