IDN with Gemini?

Côme Chilliet come at chilliet.eu
Mon Dec 7 17:35:26 GMT 2020


Hi,

Some thoughts on answers on the topic of unicode links. (I will focus on unicode in path rather than in domain here).

First, I wanted to point out that almost no-one uses them on the french Web. Some used that as an argument against having unicode in URIs, but I think no one uses them because of the punycode and percent encoding weirdness.

I read part of the RFC 3987 (IRI) and part of RFC 3986 (URI) and still do not understand what is the horrible added complexity you are talking about.
Could people asserting IRI is a complex hell impossible to implement point to the exact problems with IRI?

Here is the life cycle of a link in a page:

1 - The author writes it
2 - The server saves it
3 - A client requires the page to the server
4 - The server sends it
5 - The client display it
6 - The user click it
7 - The client resolve the hostname
8 - The client sends it as request to the server
9 - The server fetches the associated page

I think we can safely assume that the author will not write percent encoding without help.

So, with bie suggestions that clients and servers help by percent-encoding, but the author/user only have to deal with unicode, it means:
1 - somewhere between step 1 and step 4, the server have to percent-encode the link
2 - somewhere between step 4 and 5 the client needs to decode it
3 - In 8 either the client stored the encoded link or has to reencode again, or if someone copy/paste he has to reencode.
4 - In 9 the server needs to decode it to get the real target path

If we just use the utf-8 path all along, points 1 throuh 3 are not needed. 4 still is, because some links will still be percent encoded and the server needs to understand them.

> Petite Abeille <petite.abeille at gmail.com>:
> No, but the internet plumbing is de facto US-ASCII. 

If this is true, why bother with responses in utf-8?

Regarding the breaking change argument, I think it is a bit weak, testing shows there is no consistency in how different clients/servers handles unicode currently.

> bie:
> I actually implemented this in my personal gemini server
> today, and it was a trivial change (especially when compared to what I'd0
> have to do to properly validate IRIs...), allowing me to write "=> 雑念/
> 雑念" and have it sent to the client as "=> %e%9b%91%e5%bf%b5/ 雑念".

If you are all this confortable with links that looks like «%e%9b%91%e5%bf%b5» lets go the whole way and percent-encode ascii as well.
Let’s see how long before you change your mind after using this kind of stuff on a daily basis. And at least this would put all languages at the same point.
	
> colecmac at protonmail.com
> Supporting IRIs is *not* simple. For example, in Python it requires a
> third-party library[1], and in Go I wasn't even able to find one. This
> means that in many programming languages, no one would be able to even
> begin writing a Gemini client before writing a library that parses and
> conforms to the complex specification that is IRIs.

On the server I wrote in PHP, getting a request in UTF-8 worked without me doing anything for it. Not accepting IRI would actually require me to *add* code, it seems. (again I might have missed a whole lot of edgecases IRI)
In these languages, it means they are actively checking for non-ascii characters? Or are they using string format which is not compatible with utf-8?
They need to speak UTF-8 for the response anyway.
I get that *validating* an IRI might be hard, but is it the job of the server to validate it? I just use whatever input is thrown at me and suppose it is valid.
(Note that these are real non-rethorical questions, I’m not trying to deny that handling IRI would be hard, I’m trying to understand why)

(On a more general note, I guess everyone understood english is not my mother tongue, sorry if I’m being rude or something like that, I’m not trying to. I just really believe using utf-8 here would be better, but I understand there are complex technical questions involved)

MCMic




More information about the Gemini mailing list