IDN with Gemini?
Sean Conner
sean at conman.org
Mon Dec 7 23:37:51 GMT 2020
It was thus said that the Great Côme Chilliet once stated:
> Hi,
>
> Some thoughts on answers on the topic of unicode links. (I will focus on
> unicode in path rather than in domain here).
>
> First, I wanted to point out that almost no-one uses them on the french
> Web. Some used that as an argument against having unicode in URIs, but I
> think no one uses them because of the punycode and percent encoding
> weirdness.
>
> I read part of the RFC 3987 (IRI) and part of RFC 3986 (URI) and still do
> not understand what is the horrible added complexity you are talking
> about. Could people asserting IRI is a complex hell impossible to
> implement point to the exact problems with IRI?
I'm reading through RFC-3987, and sections 4 and 5 give me pause. Section
4 relates to bidirectional IRIs (right-to-left languages). This is mostly a
client issue (I think) with the displaying of such.
Section 5 is the scarier of the two---normalization and comparison, and
would most likely affect servers than clients (again, I think). There are
two examples given:
http://www.example.org/résumé.html
http://www.example.org/résumé.html
The first uses a precomposed character and the second uses a combining
character. I'm looking at the Unicode normalization standard [1], and the
first thing that struck me was I had *not* thought of the order of multiple
combining characters. Oh, there's also Hangul and conjoining jamo. And
then ... well, I'll spare the horrors of that 32k document, but the upshot
is---yes, that's yet *another* library I have to track down (and update as
the Unicode standard is regularly updated).
Also, related question---what's the filename on the server?
The "horrible added complexity" is not RFC-3987 per se, but the "horrible
added complexity" of Unicode normalization that is required. Is that a
valid excuse? Perhaps not. But there *is* the issue that a lot of people
are having with Python 3 and filenames. If you hit a filename that isn't
UTF-8 and the Python 3 script breaks badly. Yes, there *is* a link in my
mind between these two issues but I'm not sure I can verbalize it coherently
at this time. Perhaps "I will focus on unicode in the path" reminided me of
the Python 3 issue.
> Regarding the breaking change argument, I think it is a bit weak, testing
> shows there is no consistency in how different clients/servers handles
> unicode currently.
...
> (Note that these are real non-rethorical questions, I’m not trying to deny
> that handling IRI would be hard, I’m trying to understand why)
Methinks you inadverantly answered your own question---Unicode is *not*
easy [1][2][3][4].
-spc
[1] https://www.unicode.org/reports/tr15/tr15-50.html
[2] https://www.unicode.org/reports/tr9/tr9-42.html
[3] https://www.unicode.org/reports/tr14/tr14-45.html
[4] Among others. The full current standard:
http://www.unicode.org/versions/Unicode13.0.0/
More information about the Gemini
mailing list