Three possible uses for IRIs
Sean Conner
sean at conman.org
Tue Dec 8 08:36:20 GMT 2020
It was thus said that the Great colecmac at protonmail.com once stated:
>
> The fact that writing URLs for non-English languages is difficult sucks. But
> due the complexity, and most of all the fact that this would be a breaking
> change, I don't see IRIs as an option.
I thought I might see what's involved with handling IRIs.
The actual differences between RFC-3986 (URI) and RFC-3987 (IRI) besides
one being a standard (URI) and one being a proposed standard (IRI) comes
down to the characters that are accepted---the unreserved set
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
becomes
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD
and the query portion changes from
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
to
iquery = *( ipchar / iprivate / "/" / "?" )
ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@"
iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
and that's it as far as the RFCs go (aside from the rule name changes). As
a quick proof-of-concept, I just accepted all non-control UTF-8 characters
as unreserved (including the private space) as that was the easiest thing to
do, and yes, it works (but does allow potentially bad IRIs through).
But the code to *build* a URL from the parsed representation [2] ssumes
US-ASCII. Again, it would take just a few small changes to allow UTF-8
characters on input and escape them properly for a URL. That's something
I'll try working on tomorrow.
That still leaves the question of punycode [3] and Unicode normalization
(ugh).
> P.S. I'll admit I'm biased. I write more code for Gemini than I do content, and
> primarily use my native language English.
I am biased too, as a monolingual US mutt, but I do want to try this stuff
out.
-spc
[1] https://github.com/spc476/LPeg-Parsers/blob/master/url.lua
[2] https://github.com/spc476/GLV-1.12556/blob/master/Lua/GLV-1/url-util.lua
[3] RFC-3492, which includes C code to encode and decode punycode text,
which is valgrind clean (I checked).
More information about the Gemini
mailing list