[spec] IRIs, IDNs, and all that international jazz
Sean Conner
sean at conman.org
Wed Dec 23 22:21:58 GMT 2020
It was thus said that the Great bie once stated:
>
> Should have specified the language (C), too. I'm not going to be pulling
> in perl, and writing a full-fledged IRI parser from scratch in C sounds
> profoundly uncomfortable.
So what library or code are you using now to parse URIs?
When I wrote my IRI parser [1] I took my existing URL parser [2], and just
changed the unreserved rule:
ASCII: ALPHA / DIGIT / '-' / '.' / '_' / '~'
UTF-8: ALPHA / DIGIT / '-' / '.' / '_' / '~' / utf8
where 'utf8' is any character 128 or higher. I didn't bother with
restricting the private UCS set to the query because sometimes I think RFC
authors are too concerned with theory [3] than with practice and complicate
things.
Now the conversion of a domain name to punycode on the other hand ... I
left that to libidn.
-spc
[1] https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua
[2] https://github.com/spc476/LPeg-Parsers/blob/master/url.lua
[3] A charitable way of saying "smoking crack." I mean, RFC-822
(written in 1982) allows:
"Look! I'm smoking some good stuff" (no really its good) @
berkeley (in California) . edu
as a valid email address! (spaces and all) No, really, look it up!
More information about the Gemini
mailing list