[tech] doubts (was Re: [spec] [rfc] SEDR 300 VOLUME I)
Sean Conner
sean at conman.org
Tue Dec 29 23:19:14 GMT 2020
It was thus said that the Great Petite Abeille once stated:
>
> We do not appear to understand the two basic building blocks around which
> the whole of Gemini is constructed: URL and UTF-8. †
>
> Not to mention how they relate to each other.
>
> Exhibit № 1: the path segment discussion
>
> Neither Stephane, who has spend his entire adult life versed in RFC
> literature, nor Sean, a technical master if there is one, nor even
> Google's own Go library, get it right. In 2020.
>
> Path segments have been around since time immemorial.
Nope. The first RFC to specify the format for URLs is RFC-1630 (June
1994), and no path segments (as currently defined). It's not until four
years later, with RFC-2396 (August 1998) that we get the first recognizable
path segment (with path parameter).
> They are not optional. It's not a "nice to have". Not understanding them
> betray a fundamental misunderstanding of what an URL is.
First off, SETTLE DOWN!
Second, just because a feature exists doesn't mean it's actually used.
Here's a task---find ONE example out on the web where %2F has semantic
meaning *other* than a path separator, *in the path segment* of a URL (other
than your own stuff). I've been using the web since the early 90s, and I
don't think this has actually been done. I don't think I've even seen path
paramters (using the ';' sub-delimeter) used! [1]
I suspect no one got "it right" because it's just not a thing (either
encoded slashes, or path parameters) that have actually been needed. Go's
take is probably the best you can find, with both decoding it and having the
raw path available.
Even the modern concept of a URL, from the WhatWG [2], is getting away
from this. As defined there, the "path percent-encode set" is:
U+0000 .. U+001F
U+007F .. U+10FFFF
U+0020 SPACE
U+0022 "
U+0023 #
U+0025 %
U+003C <
U+003E >
U+003F ?
U+0060 `
U+007B {
U+007D }
Nothing about U+002F ('/'). I'm not saying we need to switch from
RFC-3986 (or RFC-3987) and use the WhatWG definition of a URL, but the
WhatWG is very pragmatic and are looking at *what is actually being used,*
not *what can be used.*
Here's another task---create a Gemini server where %2F *is required* to
retrieve the resource. Then see how many clients can actually query it, and
then convince all the client authors to fix their code.
Good luck. We're counting on you.
> Exhibit № 2: the URI vs. IRI saga
[ snip ]
> On the other hand, there is Solene, who single-handily, without blinking,
> demonstrates what the essence of an IRI is, with ½ a dozen lines of old
> fashioned C code. ‡
Yes, I can believe it. If you aren't really doing any heavy text
processing, then even parsing "byte-by-byte" can work well with UTF-8. I
wrote my own blogging engine [3] assuming only US-ASCII, and it works just
fine with UTF-8. Hell, even my own very simplistic URL parser I wrote back
in the mid-90s parsed a IRI just fine, but that's down to a lax parser.
Compare that with my own URL parser [4], based off the BNF of RFC-3986,
won't deal with an IRI, because of the stricter requirements.
I'm not intentionally belittling Solene, I'm just pointing out that
supporting UTF-8, depending on the processing, just sometimes happens "for
free". I also recall there being an issue with Solene's client not
supporting port numbers, so it's not always perfect engineering.
A third task for you---try writing code to properly wrap *this* page:
gemini://gemini.conman.org/test/torture/0049
You can assume a monospace font and a width of 80 character cells (aka, a
typical terminal width).
Good luck. We're counting on you.
-spc
[1] There were originally defined for ftp: URLs in RFC-1630, and only
applied after the path.
[2] https://url.spec.whatwg.org/#percent-encoded-bytes
[3] https://github.com/spc476/mod_blog
[4] https://github.com/spc476/LPeg-Parsers/blob/master/url.lua
More information about the Gemini
mailing list