Some reading on IRIs and IDNs

Sean Conner sean at conman.org
Wed Dec 9 05:26:51 GMT 2020


It was thus said that the Great Jason McBrayer once stated:
>   
> Now, what does that require of client authors and server authors?
> 
> What is the *absolute minimum* we can require of client and server
> authors and have things work?

  As I've stated, I've created an IRI parser per RFC-3987 [1] and it was a
very minimal change to my original URL parser per RFC-3986 [2].  Basically,
it allows any UTF-8 character past codepoint 128 to used, as is in the IRI. 
Languages that have URL parsers may or may not support UTF-8 data.  So IRI
parsing may or may not be an issue (aside from Unicode normalization) on a
per-language basis.

  I've also started down the punycode rabit hole.  As Stephane has stated,
DNS *can* support UTF-8, but such support isn't wide, nor is it a standard. 
Punycode was developed to encode UTF-8 with ASCII in a most Byzantine way. 
It does have an RFC (RFC-3492) and said RFC does contain code for encoding
and decoding punycode (but it's in C, and the API is ... not what I would
define but it can be worked with).  IDN support, from my experience over the
past two days, is *harder* than IRI, although the concern was mostly the
other way.  I haven't actually *gotten* to the part of converting a domain
name to punycode but in general, to convert a domain name:

	for each label [3]:
		if label has non-ASCII characters
			convert to punycode, prepend "xn--" to result

so a domain name like "納豆.english.sådär.مصر" is converted thusly:

	納豆 	-> 99zt52a -> xn--99zt52a
	english	-> (no conversion required)
	sådär 	-> sdr-rlad -> xn--sdr-rlad

	مصر

		-> wgbh1c -> xn--wgbh1c

to become "xn--99zt52a.english.xn--sdr-rlad.xn--wgbh1c" (and that last
segment is giving my editor fits because it's right-to-left).  The example
is extreme but it's just there to serve as an example of how to go about it.

  So given my experiences so far, I would say the easiest way to deal with
all this is to make it a client issue.  Hold off on IDN support for now (see
below for some more questions about it), but UTF-8 in the path and query
should be allowed in text/gemini, but encoded before making a request.  A
client, given a link like:

=> gemini://gemini.bortzmeyer.org:8965/café?foo=bar Order from the Café

should be able to parse it with the UTF-8 characters, but convert it to:

	gemini://gemini.bortzmeyer.org:8965/caf%C3%A9?foo=bar

before making the request.  At the very least, tools could be developed to
encode links in text/gemini before publishing them if no one wants the spec
to change at all.  

I feel that would be the easiest, less breaking, thing to do now.  Making
IDN (punycode) mandatory might require a bit more discussion as I'm not sure
of the language support.  I'm not even sure what name should be in a
certificate for an IDN---the full UTF-8 version, or the punycode version, or
both?  What's currently done in HTTP land about this? (answering this will
at least point in a direction, even if we don't want to go that direction).

  -spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua

[2]	https://github.com/spc476/LPeg-Parsers/blob/master/url.lua

[3]	The domain name "gemini.conman.org" has three labels, "gemini",
	"conman" and "org".  The term "label" is DNS lingo.


More information about the Gemini mailing list