Identifying robots (was Re: Open Source Proxy)

Sean Conner sean at conman.org
Fri Jul 24 01:59:03 BST 2020


It was thus said that the Great Natalie Pendragon once stated:
> For the GUS crawl at least, the crawler doesn't identify itself _to_
> crawled sites, but it does obey blocks of rules in robots.txt files
> according to user-agent. So it works without needing a user-agent
> header.
> 
> It obeys user-agent of `*`, `indexer`, and `gus` in order of
> increasing importance.
> 
> There's been some talk of the generic sorts of user-agents in the
> past, which I think is a really nice idea. If `indexer` is a
> user-agent that both sites and crawlers had some sort of informal
> consensus on, then sites wouldn't need to worry about keeping up with
> any new indexers popping up.
> 
> Some other generic user-agent ideas, iirc, were `archiver` and
> `proxy`.

  That's a decent idea, but that still doesn't help when I want to block a
particular bot for "misbehaving" (in some nebulous way).  For example,
there's this one bot, "The Knowledge AI" which sends requests like

	/%22http:/wesiseli.com/magician/%22 [1]

(and yes, that's an actual example, pulled straight off the log file).  It's
not quite yet annoying enough to block [2] but at least I have some chance
of blocking it via robots.txt (which it does request).  

  -spc

[1]	I can't quite figure out why it includes the quotes as part of the
	link.  *All* the links on my websites look like:

		<a href="http://example.com/">

	and for the most part, it can parse those links correctly.  And
	that's not limited to just *one* bot, but several of them have that
	behavior.

[2]	Although it's nearly impossible to find anything out about it as the
	user-agent string is literally "The Knowledge AI", so it might be
	worth blocking it just out of spite.



More information about the Gemini mailing list