Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

John Cowan cowan at ccil.org
Wed Nov 25 15:10:44 GMT 2020


On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote:

> (Received off-list, but I assume it was *meant* for the list, so
> replying there)
>

It was, so thanks.  My private messages are labeled (Private message) at
the top because I make this mistake a lot.

Whatever the outcome of the opt-in vs opt-out part of this discussion,
>

That's the only part that concerns me.  A robots.txt spec is good and
crawlers/archivers that respect it are fine too, though of course some
won't.

I once wrote to the author of a magazine article who had published a simple
crawler that it would hammer whatever server it was crawling, since it did
not delay between requests or intersperse them with requests to other
servers, but simply walked the server's tree depth-first. and that it
should respect robots.txt.  He wrote back saying "That's the Internet
today; deal with it."  I could have answered (but I didn't) that hits are a
cost to the server operator, and anyone running his dumb crawler was not
only DDOSing, but spending my money for his own purposes.

But I do think that once robots.txt support is in place, no robots.txt = no
expressed preference.

If it's true for people with an explicit preference, it can also be
> true for people who haven't expressed a preference yet. Since Gemini
> has a higher standard for user privacy than the web, it can also have a
> higher standard for these preferences - one that does not rely on
> presumed consent - if we want it to.
>

By this logic, nobody should be able to access a Gemini server at all
unless the capsule author has expressed a preference for them to do so.
But to publish is to expose your work to the public.

> The FAQ immediately above the one you quoted reads:
>
> > Why isn't the site I'm looking for in the archive?*
>
> > Some sites may not be included because the automated crawlers were
> > unaware of their existence at the time of the crawl. It's also
> > possible that some sites were not archived because they were
> > password protected, blocked by robots.txt, or otherwise inaccessible
> > to our automated systems. Site owners might have also requested that
> > their sites be excluded from the Wayback Machine.
>

I interpret that to mean that some sites were not crawled during the period
when the Archive was paying attention to robots.txt, and so their content
as of that date is unavailable.  Note the past tense:  "were [...]
protected by robots.txt" as opposed to "are protected".

> If archive.org didn't respect robots.txt at all, it would lend a lot of
> flavour to the "archiver" virtual user-agent idea in the companion
> spec, in addition to this discussion. Do you still have doubts after
> reading this section?
>

I have no doubt whatever that the crawler doesn't respect robots.txt.  I
could do a little experiment, though.

John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
The competent programmer is fully aware of the strictly limited size of his
own
skull; therefore he approaches the programming task in full humility, and
among
other things he avoids clever tricks like the plague.  --Edsger Dijkstra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201125/7813ec46/attachment-0001.htm>


More information about the Gemini mailing list