robots.txt for Gemini formalised
Robert "khuxkm" Miles
khuxkm at tilde.team
Mon Nov 23 02:13:11 GMT 2020
November 22, 2020 9:05 PM, "Sean Conner" <sean at conman.org> wrote:
> It was thus said that the Great Robert khuxkm Miles once stated:
>
>> Is there any good usecase for a proxy User-Agent in robots.txt, other than
>> blocking web spiders from being able to crawl gemspace? If not, I would be
>> in favor of dropping that part of the definition.
>
> I'm in favor of dropping that part of the definition as it doesn't make
> sense at all. Given a web based proxy at <https://example.com/gemini>, web
> crawlers will check for <https://example.com/robots.txt> for guidance, not
> <https://example.com/gemini?gemini.conman.org/robots.txt>. Web crawlers
> will not be able to crawl gemini space for two main reasons:
>
> 1. Most server certificates are self-signed and opt out of the CA
> business. And even if a crawler where to accept self-signed
> (or non-standard CA signed) certificates, then---
>
> 2. The Gemini protocol is NOT HTTP, so all such HTTP requests will
> fail anyway.
>
> -spc
Well, the argument is that the crawler would access <https://example.com/gemini?gemini://gemini.conman.org/>, and from there it could access <https://example.com/gemini?gemini://zaibatsu.circumlunar.space/>, and then <https://example.com/gemini?gemini://gemini.circumlunar.space/>, and so on. However, I'd argue that the onus falls on example.com to set a robots.txt rule in <https://example.com/robots.txt> to prevent web crawlers from indexing anything with their proxy.
Just my two cents,
Robert "khuxkm" Miles
More information about the Gemini
mailing list