[spec] IRIs, IDNs, and all that international jazz

Omar Polo op at omarpolo.com
Sat Dec 26 00:32:00 GMT 2020


bie <bie at 202x.moe> writes:
>
> You're kind of correct in the sense that if we just treat the request as
> arbitrary bytes and not as an IRI (no validation, no handling at all),
> it's simple, but I don't think that's the right way to look at this
> issue. Instead, it's about the complexity of proper URI handling vs
> proper IRI handling. Not to mention that IRIs can still have
> percent-encoded characters!

Sorry if it took long for the reply, but I took some time to fix up my
server and now here I am :)

Originally, when I wrote my server I did a really simple routine to
extract the path from a url and that's it.  (plus minor checking)  This
wasn't good, of course.

In the last two days I took the time to write first a proper URL
parser[0], and than extending it to support IRIs[1].  Turns out, once
you have a URL parser (not hard to do at all), you almost have a
complete IRI parser.  As Sean wrote, you basically have to replace the
unreserved rule to allow other utf8 characters and you're done.  And
even if you're uncomfortable doing this, the RFC lists the valid ranges,
so adding a couple of checks isn't the end of the world (if you want to
be 100% compliant, whatever that means).

(And all of this comes from one that has never, ever, implemented a
IRI/URI parser before, that has read for the first time the rfc3986
while writing the code and has successfully -- I believe -- implemented
a full IRI parser in less than 500 lines of C, with comments and
everything, without using anything other than the standard library.
Heck, the parser doesn't even allocates memory.)

> After thinking about this for a while, the biggest issue for me is that
> this is a breaking change. Breaking in the sense that it breaks *every
> single compliant server we already have*! If gemini, which has been
> surprisingly good at resisting breaking spec changes, accepts this, I
> don't see any reason to believe that it won't happen again and again,
> for equally silly reasons.
>
> bie

I don't buy this argument.  It's not like tomorrow we won't be able to
browse gemini unless we update clients/servers.  Valid URI are also
valid IRI, so it's not an armageddon.  The whole thing started (IIRC)
because the spec says "UTF8 URI".  Furthermore, the spec isn't finalised
yet (see for instance the change regarding full url vs relative ones in
the requests).

If you wrote your server for you, you probably won't need to change
anything: from what you wrote, I assume you're serving only files whose
names are ASCII only, so unless you want to host things with funny
names, you're probably good.

Anyway, sorry for the long reply, I didn't want to drag this discussion
too much, really.  Let's see what will be decided :)

Cheers!

[0]:
https://github.com/omar-polo/gmid/commit/33d32d1fd66a577f22f3f33f238e8dac44ec9995
[1]: https://github.com/omar-polo/gmid/commit/df6ca41da36c3f617cbbf3302ab120721ebfcfd2


More information about the Gemini mailing list