Crawlers on Gemini and best practices
Solene Rapenne
solene at perso.pw
Tue Dec 8 13:58:40 GMT 2020
On Tue, 8 Dec 2020 14:36:56 +0100
Stephane Bortzmeyer <stephane at sources.org>:
> I just developed a simple crawler for Gemini. Its goal is not to build
> another search engine but to perform some surveys of the
> geminispace. A typical result will be something like (real data, but
> limited in size):
>
> gemini://gemini.bortzmeyer.org/software/crawler/
>
> Currently, I did not yet let it loose on the Internet, because there
> are some questions I have.
>
> Is it "good practice" to follow robots.txt? There is no mention of it
> in the specification but it could work for Gemini as well as for the
> Web and I notice that some programs query this name on my server.
>
> Since Gemini (and rightly so) has no User-Agent, how can a bot
> advertise its policy and a point of contact?
depending on what you try, you may add your contact info
in the query.
First contact with a new server before you start crawling you
could get gemini://hostname/CRAWLER_FROM_SOMEONE_AT_HOST_DOT_COM
This is what I do for a gopher connectivity check.
I have to admit, it's a really poor solution but I didn't
find better way.
More information about the Gemini
mailing list