Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Krixano krixano at protonmail.com
Thu Nov 26 06:09:54 GMT 2020

Previous message (by thread): Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)
Next message (by thread): Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Gaah, ok, I hit reply instead of reply all... so my messages were sent directly to the people, lol. I'll repost them here:

I want to point out that making the assumption that a lack of robots.txt
is because servers don't mind they're content being archived is a leap in logic
that doesn't actually follow/make sense. A server/user could have just forgotten
to put a robots.txt, *or* they could have just not known about it.

> A personal example: *I* didn't have a robots.txt on my capsule file until today, but I don't want to be included in archives for various reasons. Presuming consent from the lack of a robots.txt file would have incorrectly guessed my preference, and harmed my privacy. Who else in that 90% is like me? We don't know.

Exactly! When I first got my server up, I didn't have a robots.txt for the longest time. Some of my content was actually not supposed to be archived because it was dynamic stuff. And other stuff I didn't necessarily want archived.

Christian Seibold

Sent with [ProtonMail](https://protonmail.com/) Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, November 25th, 2020 at 9:10 AM, John Cowan <cowan at ccil.org> wrote:

> On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote:
>
>> (Received off-list, but I assume it was *meant* for the list, so
>> replying there)
>
> It was, so thanks. My private messages are labeled (Private message) at the top because I make this mistake a lot.
>
>> Whatever the outcome of the opt-in vs opt-out part of this discussion,
>
> That's the only part that concerns me. A robots.txt spec is good and crawlers/archivers that respect it are fine too, though of course some won't.
>
> I once wrote to the author of a magazine article who had published a simple crawler that it would hammer whatever server it was crawling, since it did not delay between requests or intersperse them with requests to other servers, but simply walked the server's tree depth-first. and that it should respect robots.txt. He wrote back saying "That's the Internet today; deal with it." I could have answered (but I didn't) that hits are a cost to the server operator, and anyone running his dumb crawler was not only DDOSing, but spending my money for his own purposes.
>
> But I do think that once robots.txt support is in place, no robots.txt = no expressed preference.
>
>> If it's true for people with an explicit preference, it can also be
>> true for people who haven't expressed a preference yet. Since Gemini
>> has a higher standard for user privacy than the web, it can also have a
>> higher standard for these preferences - one that does not rely on
>> presumed consent - if we want it to.
>
> By this logic, nobody should be able to access a Gemini server at all unless the capsule author has expressed a preference for them to do so. But to publish is to expose your work to the public.
>
>> The FAQ immediately above the one you quoted reads:
>>
>>> Why isn't the site I'm looking for in the archive?*
>>
>>> Some sites may not be included because the automated crawlers were
>>> unaware of their existence at the time of the crawl. It's also
>>> possible that some sites were not archived because they were
>>> password protected, blocked by robots.txt, or otherwise inaccessible
>>> to our automated systems. Site owners might have also requested that
>>> their sites be excluded from the Wayback Machine.
>
> I interpret that to mean that some sites were not crawled during the period when the Archive was paying attention to robots.txt, and so their content as of that date is unavailable. Note the past tense: "were [...] protected by robots.txt" as opposed to "are protected".
>
>> If archive.org didn't respect robots.txt at all, it would lend a lot of
>> flavour to the "archiver" virtual user-agent idea in the companion
>> spec, in addition to this discussion. Do you still have doubts after
>> reading this section?
>
> I have no doubt whatever that the crawler doesn't respect robots.txt. I could do a little experiment, though.
>
> John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org
> The competent programmer is fully aware of the strictly limited size of his own
> skull; therefore he approaches the programming task in full humility, and among
> other things he avoids clever tricks like the plague. --Edsger Dijkstra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201126/ae3e48ea/attachment.htm>

Previous message (by thread): Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)
Next message (by thread): Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Gemini mailing list