robots.txt for Gemini formalised

Philip Linde linde.philip at gmail.com
Wed Nov 25 18:15:37 GMT 2020


On Tue, 24 Nov 2020 16:16:49 +0100
marc <marcx2 at welz.org.za> wrote:

> Note that the apache people worry about just doing a
> stat() for .htaccess along a path. This proposal requires an
> opendir() for *every* directory in the exported hierarchy.

Apache is designed to be able to serve large enterprises with high
request loads. The cause for their concern seems unlikely to apply to
multi-user Gemini hosts.

> I concede that this isn't impossible - it is potentially expensive,
> messy or nonstandard (and yes, there are inotify tricks or
> serving the entire site out of a database, but that isn't a
> common thing).

It's very much a matter of implementation. For example, if high
performance is a concern you can regenerate the information once per
minute rather than on a per-request basis, or on request from the users,
via a Gemini endpoint.

That's however a good argument for an Allow directive corresponding to
Disallow, to be able to disallow by default and only allowing resources
lower down in the hierarchy explicitly, which allows for a "better safe
than sorry" approach to "prevent" a crawler from picking up resources
before the new robot rules have been picked up.

> So I think this is the interesting bit of the discussion -
> the tradeoff of keeping this information inside the file or
> in a sidechannel. You are of course correct that not every
> file format permits embedding such information, and that
> is the one side of the tradeoff.... the other side is
> the argument for persistence - having the data in another
> file (or in a protocol header) means that is likely to be
> lost.

What you're proposing is doubly effective in that data that isn't there
*can't* be lost! :)

I appreciate your point, but "not every file format" is an
understatement. It's really only one file format that is controlled by
the Gemini spec right now: text/gemini. That's where we could add such
information and define it to be meaningful.

> And my view is that caching/archiving/aggregating/protocol
> translation all involve making copies, where a careless or
> inconsiderate intermediate is likely to discard information
> not embedded in the file.

A careless or inconsiderate intermediate is likely to discard
information, full stop. It's only respectful and considerate robots
that will recognize either approach.

> For instance, if a web frontend
> serves gemini://example.org/private.gmi as
> https://example.com/gemini/example.org/private.gmi
> how good are the odds that this frontend fetches
> gemini://example.org/robots.txt, rewrites the urls in there
> from /private.gmi to /gemini/example.org/private.gmi and
> merges it into its own /robots.txt ? And does it before
> any crawler request is made...

On the other hand, how likely is it that a web crawler will interpret
robot instructions from text/gemini-turned-html documents?

> A pragmatist's argument: The web and geminispace are a graph
> of links, and all the interior nodes have to be markup, so those
> are covered, and they control the reachability - without
> a link you can't get to the terminal/leaf node. And even if
> this is bypassed (robots.txt isn't really a defence against hotlinking
> either) most other terminal nodes are images or video, which typically have
> ways of adding meta information (exif, etc).

Do you propose to standardize extensions to Exif/ID3/Vorbis
comments/PDF metadata etc. as well as text/gemini? Neither these
currently have a standard way to specify a robots policy; it seems
understood that it's not a concern of the file itself whether a crawler
should be able to download it if the file is ever served over a
crawlable graph.

Hotlinking is a different concern altogether. The purpose of robots.txt
is not to disallow hotlinking.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201125/d706df49/attachment.sig>


More information about the Gemini mailing list