Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)
Nick Thomas
gemini at ur.gs
Wed Nov 25 17:00:46 GMT 2020
On Wed, 2020-11-25 at 10:10 -0500, John Cowan wrote:
> On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote:
> > If it's true for people with an explicit preference, it can also be
> > true for people who haven't expressed a preference yet. Since
> > Gemini
> > has a higher standard for user privacy than the web, it can also
> > have a
> > higher standard for these preferences - one that does not rely on
> > presumed consent - if we want it to.
> >
>
> By this logic, nobody should be able to access a Gemini server at all
> unless the capsule author has expressed a preference for them to do
> so.
> But to publish is to expose your work to the public.
Browsing, indexing, research crawling, and archiving, are all distinct
things with distinct impacts on capsule author privacy. This is why my
opening email proposed that we retain presumed consent for browing via
a proxy - it's a clear case of "one of these things is not like the
others", and the same is true for individual browsing.
This section was mostly aimed at establishing that capsule authors
should be thought of as gemini users, so it took some shortcuts on
presumed consent verbiage, which might not have been helpful.
For clarity: I think it's fine to presume consent for browsing (whether
through a proxy or not), and not fine to presume consent for archiving.
If adopted, this represents a significant enhancement to capsule author
privacy compared to web norms.
Presumed consent for indexing, I'm actually fairly marginal about. I do
think it's more appropriate to forbid than permit it, but not very
strongly.
> > > Why isn't the site I'm looking for in the archive?*
> > > Some sites may not be included because the automated crawlers
> > > were
> > > unaware of their existence at the time of the crawl. It's also
> > > possible that some sites were not archived because they were
> > > password protected, blocked by robots.txt, or otherwise
> > > inaccessible
> > > to our automated systems. Site owners might have also requested
> > > that
> > > their sites be excluded from the Wayback Machine.
>
> I interpret that to mean that some sites were not crawled during the
> period
> when the Archive was paying attention to robots.txt, and so their
> content
> as of that date is unavailable. Note the past tense: "were [...]
> protected by robots.txt" as opposed to "are protected".
I don't see any space at all to read it like that, not least due to the
references to "password protected" and "otherwise inaccessible" content
in exactly the same tense.
To me, it's crystal clear that the past tense is used here simply
because the crawl happened in the past.
I do have this blog post from April 2018, referencing archived blogs
from December 2017, where robots.txt being respected is a plot point:
https://blog.archive.org/2018/04/24/addressing-recent-claims-of-manipulated-blog-posts-in-the-wayback-machine/
The blog.archive.org rant about robots.txt not being suitable for
archivers was April 2017. that's the one that mentions they may not
respect robots.txt in-general in the future; I'd really very strongly
expect a futher blog post to appear if they start taking steps in that
direction.
It would definitely be interesting if you had an experiment or
reference demonstrating that archive.org ignores robots.txt in general,
but this page simply isn't it.
/Nick
More information about the Gemini
mailing list