robots.txt for Gemini formalised
marc
marcx2 at welz.org.za
Tue Nov 24 10:29:02 GMT 2020
Hi
I suppose I am chipping it a bit too late here, but I think
the robots.txt thing was always a rather ugly mechanism - a
bit of an afterthought.
Consider the gemini://example.com/~somebody/personal.gmi -
if somebody wishes to exclude personal.gmi from being
crawled they need write access to example.com/robots.txt,
and how do we go about making sure that ~somebodyelse,
also on example.com doesn't overwrite robots.txt with
their own rules ?
Then there is the problem of transitivity - if we
have a portal, proxy or archive - how does it relay
the information to its downstream users ? See also
the exchange between Sean and Drew...
So the way I remember it, robots.txt was a quick hack
to prevent spiders getting trapped in a maze of
cgi generated data, and so hammering the server.
It wasn't designed to solve matters of privacy
and redistribution.
I have pitched this idea before: I think a footer containing
the license/rules under which a page can be distributed/cached
is more sensible than robots.txt. This approach is:
* local to the page (no global /robots.txt)
* persistent (survives being copied, mirrored & re-exported)
* sound (one knows the conditions under which this can be redistributed)
I speak under correction, but I believe a decent amount of the
public web was mined for faces to train the neural networks
that now make totalitarian surveillance possible. Had these
been labelled "CC ND (no derivative work)" then there
would be legal impediment - not to the regimes now, but to
the universities and research labs which pioneered this.
We now have people more aware of this problem, and some
of us wish to put up material limited to gemini-space only,
and not export it to the web. A footer line "-- GMI: A. User"
could prohibit export to the web, while one "-- CC-SA: J. Soap"
would permit it...
regards
marc
More information about the Gemini
mailing list