Unicode vs. the World

Katarina Eriksson gmym at coopdot.com
Fri Dec 18 06:13:24 GMT 2020


Help! I'm getting pulled in!

On IRC, I wrote:

[2020-12-14T21:26:44Z] <CoopDot> I'm staying out of debating IDN/IRI on the
ML. What I've had to say has already been said more than once. My position
have even shifted a bit since the threads started

Some discussion happened and then I wrote something that got quoted here:


Petite Abeille <petite.abeille at gmail.com> wrote:

For example:

[2020-12-14T22:12:14.914Z] <remyabel> I lurk this channel and the mailing
lists and keep seeing people trying to extend gemini or make it web-like,
there's just no point in arguing against it
[2020-12-14T22:12:28.578Z] <CoopDot> I used to be in the US-ASCII only camp
but now it's more "do the bare mininum to not forbid UTF-8 'URLs' in the
spec and make strong recommendations in best-practices.gmi"

^Those are the "cannot be arsed" camp: things are the way they are, and
cannot be bothered to changed anything, technically speaking... we are
done. The "not-my-problem" camp.


I'm assuming including me here was intentional. I truly can't tell if that
is an accurate description of my possession.

"I used to be in the US-ASCII only camp" refers to me no longer thinking
requiring everything to be encoded to pass as US-ASCII is the best idea.
This is me moving away from the status quo towards a possible compromise.
Or am I missing where we're going?

"Do [...] not forbid UTF-8 'URLs' in the spec". Not forbidding is almost
like allowing. We should attempt to not paint our selfs into a corner or
bet on the wrong horse. 🐴

"Make strong recommendations in best-practices.gmi" because we have to
address it somewhere.

Earlier in the same email, Petite Abeille <petite.abeille at gmail.com> wrote:

It boils down to this:

 => gemini://🐰.mozz.us/🐇.gmi 🥕Hoppity hop🥕

What do do with such a construct? Possible? Not possible? Allowed? Not
allowed? First class citizen? Afterthought? How do deal with it, if at all?


[...]

=> gemini://🐰.mozz.us/🐇.gmi 🥕Hoppity hop🥕
vs.
=> gemini://xn--4o8h.mozz.us/%F0%9F%90%87.gmi 🥕Hoppity hop🥕

As it stands, the first variant cannot be handled by gemini -neither in
text/gemini, nor in the protocol itself- with further technical gotchas
such as address resolution and what not along the way.

It must be converted to the second variant, the US-ASCII one.


Let's examine the situation: 🚲

The capsule author writes this link line in their text editor:

=> gemini://🐰.mozz.us/🐇.gmi 🥕Hoppity hop🥕

The text editor may or may not change the syntax highlight to indicate an
error with the URL. When saving the file, the text editor has an
opportunity to "correct" the error by itself.

Let's say the text editor is oblivious and the capsule author doesn't run
the file through a linter. The file is ready to be served.

A visitor requests the file. The server has an opportunity to scan the file
before serving, but that would in most cases be a complete waste of
resources, so it doesn't.

The client parses the file. It has a choice to render the link line as a
link or as text. (It could also brake at the first sight of bunny, but
let's assume it doesn't.) The link is only a problem if the visitor is
following it.

At this point, it doesn't matter if the visitor follows a link or writes
the URL in the address bar. The client has a choice to translate or not
translate the URL before making the request.

Domain name resolution is outside of the scope of the Gemini specification,
we don't know if it can handle UTF-8 or not. If the visitor's network
administrator has set up name resolution to accept UTF-8, they should
probably also accept the punycoded version for compatibility.

Let's assume "always punycode" is a safe option, the client has a choice of
being proactive and do the translation or ignore it and let it fail if it
will. I say both options are valid and the Gemini specification should at
most refer to other specifications on this. (The third option to just
refuse to connect is bad.)

Moving on: (We will go back later.)

We have the IP address and the request has reached the server. Let's assume
this is over the regular internet and a punycoded domain is a must.

The server compares "xn--4o8h.mozz.us
<http://xn--4o8h.mozz.us/%F0%9F%90%87.gmi>" with whatever virtual hosts the
server administrator has set up in the configuration file. Is it
unreasonable for the administrator to expect the server software to match
"🐰.mozz.us" in the configuration file to "xn--4o8h.mozz.us
<http://xn--4o8h.mozz.us/%F0%9F%90%87.gmi>" coming in over the wire?

How about the other way around? It's a local network and ASCII
non-conforming bunnies hops into the server and the administratior has only
specified the punicode in the configuration file. Is it unreasonable to
expect it to match?

Reasonable or not, let's assume the virtual host is set up properly and go
back in time to the client making the request. What do we do about the path?

Should the client "help" the visitor by %-encoding non-ASCII bytes or send
it as is and hope for the best?

Should the client %-encode reserved characters the visitor writes in the
address bar or let them fail?

Anyway, the request reaches the server. "%20" become space and "%2b" become
plus. I see no reason why it would be hard to also convert
"%F0%9F%90%87" into bytes, so I will assume it isn't and wait for server
software programmers to tell me how wrong I am.

So now we have a string of bytes that we can use to fetch the bunny file.
Wait. What happened with the case where the bunny isn't %-encoded? Why
can't servers just blindly accept non-ASCII bytes as is? Is it a library
thing? Anyway, I really should test this in a bunch of languages but I'm
writing this on my phone on my way to work, so instead I present you this
pseudo code:

```
"%F0%9F%90%87".url_decode() == "\xF0\x9F\x90\x87".url_decode()
"%F0%9F%90%87".url_decode() == "\xF0\x9F\x90\x87"
"\xF0\x9F\x90\x87" == "🐇"
```

If these 3 lines are all true for the server software, I see no reason to
%-encode those non-ASCII bytes in the client or anywhere else. Surely I
have missed something obvious somewhere. Can anyone help me?

Maybe I just need coffee... ☕

-- 
Katarina
(Please regard these ramblings as non-rhetorical)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201218/0a0ba47e/attachment-0001.htm>


More information about the Gemini mailing list