Some reading on IRIs and IDNs
Sean Conner
sean at conman.org
Wed Dec 9 05:26:51 GMT 2020
It was thus said that the Great Jason McBrayer once stated:
>
> Now, what does that require of client authors and server authors?
>
> What is the *absolute minimum* we can require of client and server
> authors and have things work?
As I've stated, I've created an IRI parser per RFC-3987 [1] and it was a
very minimal change to my original URL parser per RFC-3986 [2]. Basically,
it allows any UTF-8 character past codepoint 128 to used, as is in the IRI.
Languages that have URL parsers may or may not support UTF-8 data. So IRI
parsing may or may not be an issue (aside from Unicode normalization) on a
per-language basis.
I've also started down the punycode rabit hole. As Stephane has stated,
DNS *can* support UTF-8, but such support isn't wide, nor is it a standard.
Punycode was developed to encode UTF-8 with ASCII in a most Byzantine way.
It does have an RFC (RFC-3492) and said RFC does contain code for encoding
and decoding punycode (but it's in C, and the API is ... not what I would
define but it can be worked with). IDN support, from my experience over the
past two days, is *harder* than IRI, although the concern was mostly the
other way. I haven't actually *gotten* to the part of converting a domain
name to punycode but in general, to convert a domain name:
for each label [3]:
if label has non-ASCII characters
convert to punycode, prepend "xn--" to result
so a domain name like "納豆.english.sådär.مصر" is converted thusly:
納豆 -> 99zt52a -> xn--99zt52a
english -> (no conversion required)
sådär -> sdr-rlad -> xn--sdr-rlad
مصر
-> wgbh1c -> xn--wgbh1c
to become "xn--99zt52a.english.xn--sdr-rlad.xn--wgbh1c" (and that last
segment is giving my editor fits because it's right-to-left). The example
is extreme but it's just there to serve as an example of how to go about it.
So given my experiences so far, I would say the easiest way to deal with
all this is to make it a client issue. Hold off on IDN support for now (see
below for some more questions about it), but UTF-8 in the path and query
should be allowed in text/gemini, but encoded before making a request. A
client, given a link like:
=> gemini://gemini.bortzmeyer.org:8965/café?foo=bar Order from the Café
should be able to parse it with the UTF-8 characters, but convert it to:
gemini://gemini.bortzmeyer.org:8965/caf%C3%A9?foo=bar
before making the request. At the very least, tools could be developed to
encode links in text/gemini before publishing them if no one wants the spec
to change at all.
I feel that would be the easiest, less breaking, thing to do now. Making
IDN (punycode) mandatory might require a bit more discussion as I'm not sure
of the language support. I'm not even sure what name should be in a
certificate for an IDN---the full UTF-8 version, or the punycode version, or
both? What's currently done in HTTP land about this? (answering this will
at least point in a direction, even if we don't want to go that direction).
-spc
[1] https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua
[2] https://github.com/spc476/LPeg-Parsers/blob/master/url.lua
[3] The domain name "gemini.conman.org" has three labels, "gemini",
"conman" and "org". The term "label" is DNS lingo.
More information about the Gemini
mailing list