crawling, proxies, and robots.txt

Sean Conner sean at conman.org
Mon Oct 12 05:06:01 BST 2020


It was thus said that the Great Michael Lazar once stated:
> Hi all,
> 
> Second, I have *slowly* been building a crawler for gemini [0]. My goal is to
> create a historical archive of geminispace and submit it somewhere for
> long-term preservation. This was inspired by Charles Lehner's recent email to
> the list [0]. I've spent a fair amount of time in gopher, and I am deeply
> saddened that so many of the important gopher servers from the 90's have been
> lost forever. Currently my crawler is respecting robots.txt files using the
> "mozz-archiver" user agent. I will probably standardize this to match my proxy
> (i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even
> respect robots files for this project (archive.org doesn't [2]), but right now
> I'm leaning towards "yes". If you are currently mirroring a large website like
> wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to
> block one of the above user-agents from those paths to make my life a bit
> easier.

  So I've added the following:

User-agent: archiver
User-agent: archiver-mozz
User-agent: mozz-archiver
Disallow: /boston

to my /robots.txt file.  I assume this will catch your archiver and prevent
my blog from being archived, given that my blog is primarily web-based (and
mirrored to Gemini and gopher) it is already archived (by existing web
archives).

  Now a question: when people can check the archive, how will the missing
portions of a Gemini site be presented?  I'm blocking the above because it's
a mirror of an existing web site (and might fit your "large" category, what
with 20 years of content there), but there's no indication in the robots.txt
file of that.

  -spc



More information about the Gemini mailing list