Identifying robots (was Re: Open Source Proxy)

Natalie Pendragon natpen at natpen.net
Thu Jul 23 23:01:03 BST 2020


For the GUS crawl at least, the crawler doesn't identify itself _to_
crawled sites, but it does obey blocks of rules in robots.txt files
according to user-agent. So it works without needing a user-agent
header.

It obeys user-agent of `*`, `indexer`, and `gus` in order of
increasing importance.

There's been some talk of the generic sorts of user-agents in the
past, which I think is a really nice idea. If `indexer` is a
user-agent that both sites and crawlers had some sort of informal
consensus on, then sites wouldn't need to worry about keeping up with
any new indexers popping up.

Some other generic user-agent ideas, iirc, were `archiver` and
`proxy`.

On Thu, Jul 23, 2020 at 04:45:50PM -0400, Sean Conner wrote:
> It was thus said that the Great Jason McBrayer once stated:
> >
> > This is cool, but when you stand it up, don't forget an appropriate
> > robots.txt!
>
>   Question---HTTP has the Use-Agent: header to help identify webbots, but
> Gemini doesn't have that.  How do I instruct a Gemini bot with robots.txt,
> when there's no way for a Gemini bot to identify itself?
>
>   -spc
>


More information about the Gemini mailing list