Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)
Nick Thomas
gemini at ur.gs
Wed Nov 25 11:32:37 GMT 2020
(Received off-list, but I assume it was *meant* for the list, so
replying there)
On Wed, 2020-11-25 at 00:36 -0500, John Cowan wrote:
>
> I understand "user privacy" to mean the privacy of people using
> clients.
> What privacy do server operators expect to have, unless they are
> using
> client certs, firewalls, or other such blockers? Barring those, they
> are
> serving content to all the world.
Yes, I've got a p.s. somewhere on the list around this potential
objection.
I don't think that server operators (perhaps better: "capsule authors")
have been explicitly ruled in when talking about user privacy in gemini
so far; but I also don't think they've been explicitly ruled out - it
just hadn been a live issue until the first archiver showed up and
(presumably in response to that) the robots.txt spec was published.
I don't find it a stretch at all to see capsule authors as gemini
users, but if we were to end up excluding them from the category for
some reason, my proposal certainly looks a lot less interesting.
Whatever the outcome of the opt-in vs opt-out part of this discussion,
the robots.txt spec gives weight to the expressed preferences of
capsule authors.. Crawler authors are being asked to respect those
preferences, and one of the possible motivations for that is a
recognition that the privacy of capsule authors is harmed by not
respecting their preferences.
Saying "I want to be in search indexes but not archives" is likely to
be motivated by privacy concerns, and an explicit robots.txt is one way
that I, as a capsule author, can expect to have privacy from archives.
If it's true for people with an explicit preference, it can also be
true for people who haven't expressed a preference yet. Since Gemini
has a higher standard for user privacy than the web, it can also have a
higher standard for these preferences - one that does not rely on
presumed consent - if we want it to.
> > As I understand it, archive.org does respect robots.txt in general,
>
> Not since 2018. See <
> https://help.archive.org/hc/en-us/articles/360004651732-Using-The-Wayback-Machine>;,
> which was updated 5 days ago and says:
The FAQ immediately above the one you quoted reads:
> Why isn't the site I'm looking for in the archive?*
> Some sites may not be included because the automated crawlers were
> unaware of their existence at the time of the crawl. It's also
> possible that some sites were not archived because they were
> password protected, blocked by robots.txt, or otherwise inaccessible
> to our automated systems. Site owners might have also requested that
> their sites be excluded from the Wayback Machine.
If archive.org didn't respect robots.txt at all, it would lend a lot of
flavour to the "archiver" virtual user-agent idea in the companion
spec, in addition to this discussion. Do you still have doubts after
reading this section?
/Nick
More information about the Gemini
mailing list