crawling, proxies, and robots.txt
Sean Conner
sean at conman.org
Mon Oct 12 05:06:01 BST 2020
It was thus said that the Great Michael Lazar once stated:
> Hi all,
>
> Second, I have *slowly* been building a crawler for gemini [0]. My goal is to
> create a historical archive of geminispace and submit it somewhere for
> long-term preservation. This was inspired by Charles Lehner's recent email to
> the list [0]. I've spent a fair amount of time in gopher, and I am deeply
> saddened that so many of the important gopher servers from the 90's have been
> lost forever. Currently my crawler is respecting robots.txt files using the
> "mozz-archiver" user agent. I will probably standardize this to match my proxy
> (i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even
> respect robots files for this project (archive.org doesn't [2]), but right now
> I'm leaning towards "yes". If you are currently mirroring a large website like
> wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to
> block one of the above user-agents from those paths to make my life a bit
> easier.
So I've added the following:
User-agent: archiver
User-agent: archiver-mozz
User-agent: mozz-archiver
Disallow: /boston
to my /robots.txt file. I assume this will catch your archiver and prevent
my blog from being archived, given that my blog is primarily web-based (and
mirrored to Gemini and gopher) it is already archived (by existing web
archives).
Now a question: when people can check the archive, how will the missing
portions of a Gemini site be presented? I'm blocking the above because it's
a mirror of an existing web site (and might fit your "large" category, what
with 20 years of content there), but there's no indication in the robots.txt
file of that.
-spc
More information about the Gemini
mailing list