Unicode vs. the World
Sean Conner
sean at conman.org
Wed Dec 16 23:41:27 GMT 2020
It was thus said that the Great Björn Wärmedal once stated:
>
> > but unreasonable for it to have to urlencode the path (a common encoding
> > for which libraries are ubiquitous)?
>
> Because — as I tried to point out — there is no reasonably simple
> heuristic for determining whether a URL is already percent encoded or not.
> And percent encoding a URL that is already percent encoded exchanges all %
> characters with %25. Attempting to punycode a domain name that is already
> punycoded, however, changes nothing at all. No heuristics are needed, the
> client can just punycode everything.
I can't say for certain what most clients do, but I'm under the impression
that some (the majority?) use some existing library to parse links. The
specification states that relative links are allowed in text/gemini:
=> ../%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B8%F0%9D%93%8E.txt Some 𝒻𝒶𝓃𝒸𝓎 stuff here
but a full URI needs to be sent to the server, so some processing of the
link is required (specifically, section 5.2 of RFC-3986). And existing
libraries help here. The library I'm currently using will parse the above
link into the following structure:
{
path = "../𝒻𝒶𝓃𝒸𝓎.txt"
}
Note how the text has been translated and any percent encoding has been
decoded. Next, the base URL of the page:
gemini://example.com/files/others/
has previously parsed (because it was needed to retrieve the page currently
being viewed):
{
path = "/files/others/",
port = 1965.000000,
host = "example.com",
scheme = "gemini",
}
The two are then merged into a single reference:
{
path = "/files/𝒻𝒶𝓃𝒸𝓎.txt"
port = 1965.000000,
host = "example.com",
scheme = "gemini",
}
Then to make a request, this new link is converted into a URI to make the
request:
gemini://example.com/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B8%F0%9D%93%8E.txt
As you can see, that process has re-encoded the path, percent-encoding it.
I would expect that some (the majority?) of clients are doing something
similar to this---doing a conversion from percent-encoding, marging
references, then converting to percent-encoding (except for the host, which
needs to be converted to punycode).
It would be instructive to know how clients are handling this---do they
decode percent-encoded data, merge the base link to the relative link and
re-encode? Or something different?
-spc
More information about the Gemini
mailing list