Unicode vs. the World

Sean Conner sean at conman.org
Thu Dec 17 09:48:43 GMT 2020


It was thus said that the Great Björn Wärmedal once stated:
> How does a client handle a link like the following:
> => essays/why-spaces-are-%20-in-URLs.gmi
> 
> The assumption here is that the author has not percent encoded
> themselves -- this is the actual filename, %20 and all.
> 
> How can the client tell if it's percent encoded or not? If you start
> by decoding it you distort the filename. If you just assume it isn't
> percent encoded and go ahead and do that you will handle this link
> correctly but break any links that are already percent encoded. I've
> only done this in python, using the urllib.parse library. I can tell
> that to encode or decode, but it will do what I tell it to without
> exception. It's up to me to build logic that avoids breaking the edge
> cases.
> 
> We can decide to *always* percent encode links in gemtext (as the spec
> states now) or to *never* do it, but I don't see how we can reasonably
> have both. And never doing it means we can never link to a file with
> spaces in the URL, and will have to percent decode anything we copy
> paste from web browser's address bar. There will be extra work for
> authors either way.
> 
> Consider another hypothetical case:
> => teddybearoftheyear.com/vote?ew0k%20The%20Great Vote for me!
> 
> How would you solve that?
> 
> However much I *want* to have IRIs and IDNs in gemtext and leave the
> work to clients and servers, I don't have a solution for that as an
> implementer.

  I don't have a solution either, and while trying to nail down every
possible corner case is admirable, sometimes, you just have to say, "don't
do that!" (or in other words, document or warn about the corner case).

  It's already the case on Unix systems where a file name can technically
have any character other than '/' (because it's the path separator) and NUL
(marks the end of the string), but I doubt you'll find any filenames with
control characters [1] or even "problematic characters because of the shell"
like "&", "?", or "*" in them.  People just kind of learn what they can and
can't use for filenames over time.  In fact, that might be an interesting
thing for Lupa [2] or GUS to report on---characters found in filenames [3].

  I'm not sure how apropos this is, but years ago, when I was at university
studying Computer Science, I was writing a program (for a friend, not course
related) where I wanted to log errors so they would later be seen (as the
program would run unattended, and any messages to the display would not be
seen).  I could log to a file, but the disk could fill up.  Okay, if that
happened, I could log to the printer, but there might not be a printer (or
it could be turned off---this back when printers were hooked directly to a
computer).  I asked one of my instructors (who worked at IBM, and was on the
team for one of the first Fortran compilers for IBM) what I should do.  His
advice was (and as sad as this is, it's pretty true), if you don't know how
to handle an error, don't bother looking for it.

  -spc

> Cheers,
> ew0k

[1]	Unless it's for pranking someone, not that I would know that.

[2]	Stéphane's new research crawler for Gemini.

[3]	This reminds me, I have a new feature on my own server that allows
	one to dive into a ZIP file:

		gemini://gemini.conman.org/test/UCSD-Pascal-source.zip/

		vs.

		gemini://gemini.conman.org/test/UCSD-Pascal-source.zip

	Right now it's not much of an issue since the filenames for the
	"proof-of-concept" file are just plain ASCII, but in the general
	case, I suppose I should support conversion of filenames to UTF-8,
	but that might be a hard case as well, as character encodings aren't
	readily recorded in ZIP files.


More information about the Gemini mailing list