[tech] [eli5] URI = IRI = ASCII = UTF-8 = Unicode
Stephane Bortzmeyer
stephane at sources.org
Sun Jan 3 13:55:22 GMT 2021
On Wed, Dec 30, 2020 at 12:25:31AM +0100,
Petite Abeille <petite.abeille at gmail.com> wrote
a message of 14 lines which said:
> > URI's use UTF-8 encoded octets only by popular convention and not by
> > any hard rule. You can stick any kind of binary data into a URI as
> > long as you percent-encode the non-ASCII bytes.
>
> Yes, indeed. Any random binary will do, e.g. the query portion could
> contain any weird binary data one sees fit to put there.
>
> Not so much in other parts of the URI though, UTF-8 rules there.
This is not true. As Michael said, URI are bytes, not characters. The
encoding is anyone's guess.
Two details:
* paths (not just queries) can contain any binary garbage but there
are special rules for hostnames.
* the RFC has provisions for "a new URI scheme" which may apply to
us. We can decide here that URI of scheme "gemini" MUST be entirely in
UTF-8.
More information about the Gemini
mailing list