Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)
John Cowan
cowan at ccil.org
Thu Nov 26 00:57:49 GMT 2020
On Wed, Nov 25, 2020 at 12:01 PM Nick Thomas <gemini at ur.gs> wrote:
It would definitely be interesting if you had an experiment or
> reference demonstrating that archive.org ignores robots.txt in general,
> but this page simply isn't it.
Okay, I rediscovered the page I was looking for: <
https://webmasters.stackexchange.com/questions/71377/how-to-properly-disallow-the-archive-org-bot-did-things-change-if-so-when/
>.
Search on that page for "random item on eBay". This test shows that as of
May 2017, the IA was supporting robots.txt. I tried this myself, and it
shows three crawls, two in 2019 and one in 2020, that agree with what you
see ("Unknown item") if you follow the direct link. This agrees with the
claim that robots.txt was turned off in 2018 for all sites; however,
apparently the IA is not announcing this. My guess (only a guess) is that
IA thinks that people who don't want to be archived will start using less
reliable mechanisms like blocking IP addresses.
Now search farther down the page for "just did a quick test". This shows
that as of March 2017 the IA was refusing to display pages put off-limits
by robots.txt, consistently with the above. However, when the robots.txt
entry was removed, crawls from 2010 through 2017 suddenly appeared! So
even in the pre-2018 regime, the IA was crawling the pages but hiding them.
John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org
If I have seen farther than others, it is because I was looking through a
spyglass with my one good eye, with a parrot standing on my shoulder. --"Y"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201125/d8d889c9/attachment.htm>
More information about the Gemini
mailing list