Crawlers on Gemini and best practices
colecmac at protonmail.com
colecmac at protonmail.com
Wed Dec 16 21:24:27 GMT 2020
On Saturday, December 12, 2020 4:46 AM, Leo <list at gkbrk.com> wrote:
> > Why are we defining new standards and filenames? bots.txt, .well-known,
> > etc.
>
> Just want to point out that the .well-known path for machine-readable
> data about an origin is a proposed IETF standard that has relatively
> widespread use today. It is filed under RFC8615, and is definitely not a
> standard that was invented in this thread.
Yes, I'm aware of .well-known. I meant that using it to hold a robots.txt-type
file would be a new filepath.
> The first paragraph of the introduction even references the robots file.
>
> While I don't necessarily agree with the naming of bots.txt, I see no
> problem with putting these files under a .well-known directory.
My only problem was that it would be reinventing something that already
exists, but I likewise have no issue with the idea of .well-known in general.
> > We don't need this.
>
> Thanks for making this mailing list a lot more efficient and talking
> about what the Gemini community needs in a 4-word sentence.
Sorry, perhaps that was too curt. I don't intend to speak for everyone,
it's only my opinion. However it wasn't just a 4-word sentence, I backed
up my opinion with the rest of my email.
> Even if the original path of /robots.txt is kept, I think it makes sense
> to clarify an algorithm in non-ambiguous steps in order to get rid of
> the disagreements in edge-cases.
>
> > let's just pick a standard that works and make it offical.
>
> The point is that the standard works for simple cases, but leaves a lot
> to be desired when it comes to clarifying more complex cases. This
> results in a lot of robots.txt implementations disagreeing about what is
> allowed and not allowed.
I agree, that's why I was trying to pick a standard instead of developing
our own. Picking a well-written standard will cover all these cases.
> Additionally by crawling the Web, you can see that people tend to extend
> robots.txt in non-standard ways and this only gets incorporated into
> Google's crawlers if the website is important enough.
Ok? I don't see how that's relevant. By picking a standard here and sticking
to it, we can avoid that.
> > I am no big fan of Google, but they are the kings of crawling and it
> > makes sense to go with them here.
>
> The kings of crawling deemed HTTP and AMP the most suitable protocol and
> markup format for content to be crawled, why don't we stop inventing
> standards like Gemini and Gemtext and go with them here.
Again I don't see how that's relevant. Yes "we" don't like AMP here, yes
we don't like HTTP, etc. This doesn't make the robots.txt standard I sent
a bad one.
> > The spec makes many example references to HTTP, but note that it is
> > fully protocol-agnostic, so it works fine for Gemini.
>
> Gemtext spec makes references to Gemini, but it is fully
> protocol-agnostic, so it works fine with HTTP. Similarly, Gemini makes
> many references to Gemtext, but it is content-type agnostic so it works
> fine with HTML. But we thought we could be better off shedding
> historical baggage and reinvented not one, but two main concepts of the
> traditional Web.
>
This quote, along with this line:
> why don't we stop inventing standards like Gemini and Gemtext and go with
> them [Google] here.
makes me think you misunderstand some of Gemini's ideas. Solderpunk has
talked about the idea of "radical familiarity" on here, and how Gemini uses
well known (ha) protocols as much as possible, like TLS, media types, URLs,
etc. Gemini tries to *avoid* re-inventing! Obviously the protocol itself
and gemtext are the exceptions to this, but otherwise, making niche protocols
only for Gemini is not the way to go (in my opinion). It makes things harder
to implement and understand for developers.
Cheers,
makeworld
More information about the Gemini
mailing list