Text reflow woes (or: I want bullets back!)y
Sean Conner
sean at conman.org
Sat Jan 18 23:36:37 GMT 2020
It was thus said that the Great Brian Evans once stated:
> Aaron Janse writes:
> > Hmmm. It does seem, though, that *allowing* ANSI colors would require
> > non-terminal clients to strip ANSI colors, which would be a PITA,
> > expecially considering that ANSI is a hot mess (I built an ANSI parser
> > a while ago [1])
>
> Currently Bombadillo has a few different modes. The normal mode removes
> ansi escape codes. As I am parsing a document if I read an `\033` character I
> just toggle an escape code boolean and then consume until I read a A-Za-z
> character (and consume that char as well). It works very quickly and handles
> removing them quite well. I do the same thing for the color mode for any
> escape codes that do not end in `m`. That said, it may not work as well for
> people not parsing by writing characters into a buffer char by char.
Having written an ECMA-48 (the terminal control codes everybody calls ANSI
escape codes when they aren't defined by ANSI) parser you'll probably catch
99% of the control codes used. But the actual definition is (RFC-5234 BNF):
CSI = %d27 '['
/ %d155 ; ISO-8859-1 or similar
/ %d194 %d155 ; UTF-8 encoding
param = %d48-63 ; chars '0' through '?'
meta = %d32-47 ; chars ' ' through '/'
cmd = %d64-126 ; chars '@' through '~'
sequence = CSI *param *meta cmd
There are other ECMA-48 sequences that could prove dangerous if not
filtered for. I do have Lua code to parse these [1][2] and use them in my
current gopher client to filter them out (and yes, I have come across sites
that embed ECMA-48 control codes).
> 2. Do a simple find and replace on the whole document for '\033' and replace
> it with "ESC". While this will still leave the codes displaying to the viewer
> they will not actually render, thus you do not need to worry about line
> movement, screen clears, etc.
You might want to replace the following codepoints to render control codes
harmless:
0 - 31 ; C0 set, except interpret the range from 7-13 inclusive
127 ; DEL
128-159 ; C1 set
I say codepoints because in UTF-8, the C1 set is represented by the
sequences
194 128 through 194 129
-spc
[1] https://github.com/spc476/LPeg-Parsers/blob/master/iso/control.lua
This handles encodings in ISO-8859-1 and similar. I have a UTF-8
one that is separate. This one just returns the escape sequence as
a unit with no further parsing of the actual sequence.
[2] https://github.com/spc476/LPeg-Parsers/blob/master/iso/ctrl.lua
This does a more complete parse of the escape sequence, to include
its name (if any). Again, This is for ISO-8859-1 and similar
encodinds. I have another version for UTF-8.
More information about the Gemini
mailing list