Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

Robert "khuxkm" Miles khuxkm at tilde.team
Tue Nov 24 19:08:57 GMT 2020


I am personally against this idea of forcing (or even normalizing) browsers giving special treatment to a request for a URL based on what the server would normally respond (I'm not even going to entertain the idea of pretending the internet draft doesn't apply to us). This is what I assume it would look like in spec (or best practices, or wherever you want to put it):

> When a client makes a request for a URI with a path component of "/robots.txt", and the server would normally respond to such a request with a 51 Not Found status code, it should instead respond with a 20 status code, a MIME type of text/plain, and content of "User-Agent: *\r\nDisallow: /\r\n". This prevents capsules from being indexed without consent.

Doesn't that just *feel* like a hack to you?

I did some research with GUS's known-hosts list. Of the 362 hosts known to GUS, only 36 have a robots.txt file, so any choice made as to what the default robots.txt should be will affect around 90% of Geminispace (not to mention any new hosts to come). Notably, of the 36 hosts to impose a robots.txt, 7 of them completely block archiving (although that number is skewed, as I know that at least 3 of those hosts are ran by the same person, and 2 of those hosts are ran by another person). This means that anywhere between 2% (all of the hosts who don't have a robots.txt are fine with being archived) to 20% (the sample of people who have robots.txt is representative of the whole population), or even 91% (everybody without a robots.txt doesn't want to be archived). I don't feel comfortable making a declaration either way, but this is food for thought.

Just my two cents,
Robert "khuxkm" Miles


More information about the Gemini mailing list