[spec] IRIs, IDNs, and all that international jazz

Solderpunk solderpunk at posteo.net
Tue Dec 22 15:13:06 GMT 2020


Hi folks,

Okay, I'm finally getting involved in this discussion.  Sorry it took
me a while, and thanks for your patience.  Here's a characteristically
long email detailing how my thinking on this front has evolved in
just the past few days, starting a new thread with the [spec] topic tag.

My a priori thoughts when it became clear that this discussion was
turning into a major issue, but before I had delved into any details,
were something like this:

"Good support for arbitrary languages in Gemini is *important* and
worth putting up with a little bit of pain for.  This is the reason
the `lang` parameter was defined for the text/gemini media type,
because a text encoding alone is not sufficient for a client to know
to do what native speakers of some languages expect (like render
text right to left).  As weird and foreign as this stuff might seem
to a lot of people, only one (English) of the ten most widely spoken
languages in the world (and this doesn't change whether you count
only native speakers or all speakers) can be properly represented
in ASCII, so bailing on unicode support when it seems too hard is
very hard to justify and we should try hard to do the right thing.
That said, there obviously has to be an upper limit on complexity.
Hopefully we can strike a good balance..."

At this point, I'll also add that it was obviously my intention from
the very early days that internationalised URLs "just work" in Gemini.
The clue to this is that the spec defines Gemini requests in terms
of "UTF-8 encoded URLs".  Now that I'm a little wiser about these
things I realise that URIs (and hence URLs) by definition contain only
characters which are encoded identially in UTF-8 and ASCII, so that
"UTF-8 encoded URL", while not a contradiction of any sort, is not
a particularly powerful concept and does nothing to achieve i18n.
But I was certainly naively hoping that it did.  In my ideal world,
something like an IRI would absolutely work in Gemini with a minimum
of fuss.

Anyhow, the other night I read RFC 3987.  Not word for word, mind you,
but more than a casual skim.  At which point my thoughts became:

"Why on Earth is everybody on the ML banging on about punycode this
and normalisation that?  None of that would be relevant for Gemini.
That complexity is only required to transform IRIs into URIs, which
is a workaround for legacy software, document formats and protocols
which can't handle IRIs directly.  Gemini isn't legacy - if we did a
`s/URL/IRL/g` on the spec, we could just pass around UTF-8 encoded
IRLs without any of this hassle and things would just work.  The spec
already [somewhat mistakenly: see above] makes it clear that UTF-8 is
to be expected in requests.  This is a trivial change, not breaking
at all, let's just do it.

Of course, conversion of IDNs to punycode for the sake of DNS lookups
would still be required because we can't change the reality of deployed
DNS infrastructure, but it's insane to think this is the responsibility
of every individual client author, it's up to operating systems and
standard libraries to abstract this away.  Surely they already do this?
Let's check...

Python 3.7.3 (default, Apr  3 2019, 05:39:12)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.getaddrinfo("räksmörgås.josefsson.org", 1965)
[(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2001:9b1:8633::102', 1965, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('2001:9b1:8633::102', 1965, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_RAW: 3>, 0, '', ('2001:9b1:8633::102', 1965, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('178.174.241.102', 1965)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('178.174.241.102', 1965)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('178.174.241.102', 1965))]

Yep, great, wonderful, Python does the punycode stuff for me invisibly,
this adds no extra complexity at all!"

At this point I was, in my private thoughts, a pretty hardcore IRI
advocate, and didn't understand why anybody wouldn't be.

Then I did a little more experimenting and realised that DNS lookups
in Go don't transparently handle the punycoding like Python does, and I
was quite disappointed in Go for that.  Then I started reading through
all the mailing list posts, and realised that people weren't even upset
so much about punycoding IDNs as they were about processing IRIs to
e.g. absolutise relative IRIs or add queries.  This was considered to
require complex third party libraries in most languages.  I was kind
of baffled by this because doing this kind of operation with IRIs is
not substantially different from doing it with URIs (as Sean has shown
by actually implementing it) and I couldn't believe that something
so trivial wouldn't be well handled by standard libraries in 2020
(and, actually, based on some people's posts to the ML it seems like
it often is).  At this point my attitude became:

"Wow, the uptake of these standardised i18n tools in major programming
languages is nothing short of embarrassing.  I would be in favour
of defining Gemini as using IRLs not URLs, but when e.g. clients
written in Go fail to "just work" with these, we do not blame the
client authors and ask them to move mountains to work around the
deficiencies of their standard libraries, but blame the language
implementers.  Over time, surely, the existing DNS and URI libraries
will all be updated to follow the new standards, and those "broken"
clients will suddenly become "working" clients without their authors
even having to do anything.  It's unfortunate that there will be a
transitional period where the Gemini spec is somewhat "aspirational"
and some clients necessarily fall short due to the failings of others,
but that's better than leaving things as is and having Gemini be
forever broken with regards to internationalisation."

Then I followed the mailing list threads yet deeper, and reached
the point where Jason pointed out that RFC 3987 is only a proposed
standard, that it has effectively been abandoned by the IETF,
and that now the W3C has its own alternative standardisation of
"URL" under active development, which is "extremely WWW-centric"
(I'm taking Jason's word for this, I haven't actually looked into the
details of this yet).  This completely undermines my attitude above,
because it makes it much less likely that standard libraries will
ever be uniformly upgraded to handle IRIs correctly, and it means
we can't take the simple moral highground of saying that the Gemini
spec is based on IETF standards and it's not our fault if standard
libraries still need to lift their game to reflect those standards.

Now I honestly don't know what to think.  It has always been a core
tenet of Gemini's design that it is made by joining together mature,
widely-implemented IETF standards in simple ways, so that no heavy
lifting is required to build Gemini software in almost any language
on any almost platform because all the parts are "radically familiar".
I'm very reluctant to move away from that ideal, it's one of our core
strengths.  But I also think localisation is important and, within
reason, I buy the argument that there's a moral obligation to at least
seriously try to fix this, and the fact that other technology stacks
like the web have not is no excuse for us to do them same when we have
the opportunity to make a fresh start.  But these two principles are in
hard conflict.  There apparently *are* no mature, widely-implemented
IETF standards to handle non-ASCII URLs.  This sucks, and I really wish
it were otherwise, but I (and we, the Gemini community) are,
realistically, absolutely powerless to change this, no matter how much
we might like to.

But, *something* has to be decided.  All we can really do is be
pragmatic: consider how much pain is required to get some support
for internationalised addressing into Gemini, and consider who has
to bear that pain.  Ideally, we try to minimise the total amount
of pain, and preferentially inflict more pain on software authors
than on content authors (who are not necessarily developers or even
"power users"), and more pain on server authors than on client authors
(it's of more benefit to more people for it to be easy to roll your
own client than for it to be easy to roll your own server).

The options, then, would appear to be:

1. Nothing changes in the spec (except we remove the language
about "UTF-8 encoded URLs" because this is, frankly, a recipe for
misunderstanding).  Gemini runs entirely on URLs using only a subset
of ASCII.  Clients and servers are permitted to be highly "dumb" in
this regard, and no existing software breaks.  Ultimate responsibility
for internationalised links falls to content authors, who are obligated
to fully punycode and percent-encode all their links so they are
valid URIs, and if they do this wrong their links don't work and
it's nobody's fault but their own, and if they don't understand what
any of that even *means* they are forced to use ASCII URLs instead.
Client authors who want to be i18n friendly can visually present these
links as IRIs if they're up to it, and accept IRIs in the address
bar (or equivalent) and encode them before doing name lookups or
sending requests.  This voluntary extra complexity requires being
able to do punycoding and percent-encoding in both the forward and
backward directions.

2. We stick to ASCII-only URLs in Gemini requests, but allow IRIs in
text/gemini and require all clients to be able to suitably encode
IRIs before doing name lookups or sending requests, and to accept
IRIs in the address bar.  Content authors just write their content
in their ordinary editor in an ordinary human-readable way without
knowing what punycode or perecent enoding are.  All client authors
need to be able to do punycoding and percent-encoding in the forward
direction only.  If no standard library support for this is available,
these operations need to be done from scratch.

3. We treat RFC 3987 as a first-class entity in our world, even if
the IETF has abandoned it.  IRIs are used everywhere, in text/gemini
documents and in requests.  Nobody ever has to do percent encoding
in any direction (beyond what is already required for standard URIs).
The forward punycoding requirement remains as per 2. above.  However,
instead of having to do forward percent encoding, clients now need to
be able to do things like absolutise relative IRIs.  If no standard
library support for this is available, this needs to be done from
scratch - although, note that if standard library support for percent
encoding forward and backwards is present, then the standard library
support for relativising ASCII URLs, which we are basically already
asuming is present everywhere, is sufficient to build this up, so
this is not anywhere near as scary as it seems.  There's an addition
wrinkle here in that unicode normalisation needs to be consistent
between e.g. the client and server's idea of the domain name.
This could, I think be made entirely the server's responsibility,
by requiring servers to normalise requests in a particular way.

Obviously, option 1. is preferable from the point of view of a spec
author or a software implementer, but it has to be acknowledged
that it throws international content authors under the bus (it's
true there are such authors on the ML who are happily doing exactly
what this option requires, but we need to acknowledge that people
who can converse in technical English about protocol design on a
mailing list are not a representative sample!).  From the point of
view of international content authors, 2. and 3. are equivalent.
It's true that this problem could be minimised by the availability
of servers which transform text/gemini content on the fly, and it's
true that historically I've been happiest dumping extra complexity
on server authors, but I'm not sure this is ideal - users might move
their content between hosts and suddenly have their links break,
which will seem mysterious to them.

Regarding options 2. and 3., from a strictly conceptual/aesthetic
perspective, 3. is clearly perferable.  It's much nicer not to have to
map back and forth between a user's perspective of what addresses look
like and a machine's perspective, but to use the same representation
for both.  The less client-side munging of what's in a link line,
the better.  And following an *absolute* IRI is actually easier under
option 3. than under option 2, because it preserves the beautifully
simple idea that to follow a link, you just send the corresponding
server exactly what you find in the document, not some transformation
of it or some subpart of it.  A text/gemini link line is, in fact, a
ready-to-use request with a label on it!

But we need to consider the implementation burden.  Both 2. and
3. require exactly the same punycoding before DNS lookup (and I still
hope this will become more and more transparently handled by standard
libraries over time), so it comes down to what's more widely supported
and what's easiest to implement in the absence of support: percent
encoding an IRI to a URI so it can be parsed, possibly absolutised
and then sent over the wire as a purely ASCII request, or parsing
and possibly absolutising an IRI as-is before sending it as UTF-8?

It seems to have been a big point of concern on the ML that IRI parsing
is rarely supported in standard libraries and difficult to implement
from scratch, and that this totally sinks something like option 3.
But it seems to be the case that IRIs can in fact be processed with
standard tools in Python and Go, and sort-of-kinda in Java.  Of course,
that's not everywhere, but the capability doesn't exactly seem rare.
And in any environment where option 2. is easy, it seems to me that
3. could be achieved roughly as easily just by transforming an IRI
to a URI, parsing that and doing absolutisation with the standard URI
tools that we assume exist everywhere, and then translating back to an
absolute IRI in the end before sending the request.  The basic idea is
that transformation from IRI to URI happens as a last resort, only when
necessary, and the transformation is reversed as early as possible.
The extent and kind of transformation required is directly proportional
to how stubbornly ASCII-only the environment is.  There might be some
environments (seemingly Python could be one) where transformation
*never* needs to happen, and that seems better than an approach where
transformation *always* needs to happen.  So option 3. actually seems
within the realm of possibility to me, although I wouldn't want to
put my full weight behind it until some actual testing has taken place.

It's true that this would be a breaking change, although of a different
kind from other breaking changes I've pushed back against in the past.
It's not as if Geminispace would suddenly become impossible to access,
or would split into two totally incompatible subspaces based on the
old and new protocol versions.  Any currently extant Gemini document
which included ASCII-only links would remain perfectly accessible
by old and new clients alike.  So, it's a relatively soft break.
Given the importance of first-class internationalisation support,
it might be worthwhile.

Feedback welcome, especially if I've overlooked anything, which is
certainly possible.  What I'd be most interested in hearing, at this
point, is client authors letting me know whether the standard library
in the language their client is implemented in can straightforwardly:

1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
   technically not URLs at all, you know what I mean) in paths and/or
   domains?
2. Transform back and forth between URIs and IRIs?
3. Do DNS lookups of IDNs without them being punycoded first?  You can
   test this with räksmörgås.josefsson.org.

Getting good data on all three of these questions for a wide range
of languages is necessary to make a well-informed decision here.

Cheers,
Solderpunk


More information about the Gemini mailing list