Three possible uses for IRIs

Sean Conner sean at conman.org
Tue Dec 8 08:36:20 GMT 2020


It was thus said that the Great colecmac at protonmail.com once stated:
> 
> The fact that writing URLs for non-English languages is difficult sucks. But
> due the complexity, and most of all the fact that this would be a breaking
> change, I don't see IRIs as an option.

  I thought I might see what's involved with handling IRIs.

  The actual differences between RFC-3986 (URI) and RFC-3987 (IRI) besides
one being a standard (URI) and one being a proposed standard (IRI) comes
down to the characters that are accepted---the unreserved set 

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

becomes

   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

and the query portion changes from

   query         = *( pchar / "/" / "?" )
   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"


to

   iquery         = *( ipchar / iprivate / "/" / "?" )
   ipchar         = iunreserved / pct-encoded / sub-delims / ":" / "@" 
   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

and that's it as far as the RFCs go (aside from the rule name changes).  As
a quick proof-of-concept, I just accepted all non-control UTF-8 characters
as unreserved (including the private space) as that was the easiest thing to
do, and yes, it works (but does allow potentially bad IRIs through).  

  But the code to *build* a URL from the parsed representation [2] ssumes
US-ASCII.  Again, it would take just a few small changes to allow UTF-8
characters on input and escape them properly for a URL.  That's something
I'll try working on tomorrow.

  That still leaves the question of punycode [3] and Unicode normalization
(ugh).

> P.S. I'll admit I'm biased. I write more code for Gemini than I do content, and
> primarily use my native language English.

  I am biased too, as a monolingual US mutt, but I do want to try this stuff
out.

  -spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/url.lua

[2]	https://github.com/spc476/GLV-1.12556/blob/master/Lua/GLV-1/url-util.lua

[3]	RFC-3492, which includes C code to encode and decode punycode text,
	which is valgrind clean (I checked).


More information about the Gemini mailing list