[spec] IRIs, IDNs, and all that international jazz

Sean Conner sean at conman.org
Wed Dec 23 22:21:58 GMT 2020


It was thus said that the Great bie once stated:
> 
> Should have specified the language (C), too. I'm not going to be pulling
> in perl, and writing a full-fledged IRI parser from scratch in C sounds
> profoundly uncomfortable.

  So what library or code are you using now to parse URIs?

  When I wrote my IRI parser [1] I took my existing URL parser [2], and just
  changed the unreserved rule:

	ASCII:	ALPHA / DIGIT / '-' / '.' / '_' / '~'
	UTF-8:	ALPHA / DIGIT / '-' / '.' / '_' / '~' / utf8

where 'utf8' is any character 128 or higher.  I didn't bother with
restricting the private UCS set to the query because sometimes I think RFC
authors are too concerned with theory [3] than with practice and complicate
things.

  Now the conversion of a domain name to punycode on the other hand ... I
left that to libidn.

  -spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua

[2]	https://github.com/spc476/LPeg-Parsers/blob/master/url.lua

[3]	A charitable way of saying "smoking crack."  I mean, RFC-822
	(written in 1982) allows:

		"Look!  I'm smoking some good stuff" (no really its good) @
		berkeley (in California) . edu

	as a valid email address!  (spaces and all) No, really, look it up!


More information about the Gemini mailing list