robots.txt for Gemini formalised

Robert "khuxkm" Miles khuxkm at tilde.team
Mon Nov 23 02:13:11 GMT 2020

Previous message (by thread): robots.txt for Gemini formalised
Next message (by thread): robots.txt for Gemini formalised
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

November 22, 2020 9:05 PM, "Sean Conner" <sean at conman.org> wrote:

> It was thus said that the Great Robert khuxkm Miles once stated:
> 
>> Is there any good usecase for a proxy User-Agent in robots.txt, other than
>> blocking web spiders from being able to crawl gemspace? If not, I would be
>> in favor of dropping that part of the definition.
> 
> I'm in favor of dropping that part of the definition as it doesn't make
> sense at all. Given a web based proxy at <https://example.com/gemini>, web
> crawlers will check for <https://example.com/robots.txt> for guidance, not
> <https://example.com/gemini?gemini.conman.org/robots.txt>. Web crawlers
> will not be able to crawl gemini space for two main reasons:
> 
> 1. Most server certificates are self-signed and opt out of the CA
> business. And even if a crawler where to accept self-signed
> (or non-standard CA signed) certificates, then---
> 
> 2. The Gemini protocol is NOT HTTP, so all such HTTP requests will
> fail anyway.
> 
> -spc

Well, the argument is that the crawler would access <https://example.com/gemini?gemini://gemini.conman.org/>, and from there it could access <https://example.com/gemini?gemini://zaibatsu.circumlunar.space/>, and then <https://example.com/gemini?gemini://gemini.circumlunar.space/>, and so on. However, I'd argue that the onus falls on example.com to set a robots.txt rule in <https://example.com/robots.txt> to prevent web crawlers from indexing anything with their proxy.

Just my two cents,
Robert "khuxkm" Miles

Previous message (by thread): robots.txt for Gemini formalised
Next message (by thread): robots.txt for Gemini formalised
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Gemini mailing list