Documents with mixed languages

Michael Lazar lazar.michael22 at gmail.com
Sun Dec 12 14:46:15 GMT 2021


On Sun, Dec 12, 2021 at 5:48 AM Stephane Bortzmeyer
<stephane at sources.org> wrote:
>
> On Sat, Dec 11, 2021 at 01:06:25PM -0500,
>  Michael Lazar <lazar.michael22 at gmail.com> wrote
>  a message of 75 lines which said:
>
> > Your best bet, if you're serious about this, is to go ahead and
> > implement your proposal in your client. If others find it useful
> > they will start using it too
>
> I'm not sure it is something to recommend since it can leads to "de
> facto" standards and to "best viewed with client XYZ > 7" which were
> one of the reasons we ran away from the Web.

Speak for yourself. De-facto standards are social proof that a subset
of the community actually wants and will use a feature. Which is much
more convincing to me than a loud minority arguing for (or against)
something based on principle alone.

> > This doesn't allow for mixed languages inside of a single
> > line/paragraph though.
>
> Unicode has a solution, but its use is discouraged
> <http://unicode.org/faq/languagetagging.html>. "Most other users who
> need to tag text with the language identity should be using standard
> markup mechanisms, such as those provided by HTML, XML, or other rich
> text mechanisms."

This is super interesting! I wonder what "deprecated" means for
unicode, surely they wouldn't release a backwards incompatible
version.
Their implementation guidelines systematically refute the argument for
language tags.

Requirements for Language Tagging

The requirement for language information embedded in plain text data
is often overstated. Many commonplace operations such as collation
seldom require this extra information. In collation, for example,
foreign language text is generally collated as if it were not in a
foreign language. (See Unicode Technical Standard #10, “Unicode
Collation Algorithm,” for more information.) For example, an index in
an English book would not sort the Slovak word “chlieb” after “czar,”
where it would be collated in Slovak, nor would an English atlas put
the Swedish city of Örebro after Zanzibar, where it would appear in
Swedish.

Text to speech is also an area where the case for embedded language
information is overstated. Although language information may be useful
in performing text-to-speech operations, modern software for doing
acceptable text-to-speech must be so sophisticated in performing
grammatical analysis of text that the extra work in determining the
language is not significant in practice.

Language information can be useful in certain operations, such as
spell-checking or hyphenating a mixed-language document. It is also
useful in choosing the default font for a run of unstyled text; for
example, the ellipsis character may have a very different appearance
in Japanese fonts than in European fonts. Modern font and layout
technologies produce different results based on language information.
For example, the angle of the acute accent may be different for French
and Polish.

Language Tags and Han Unification

A common misunderstanding about Unicode Han unification is the
mistaken belief that Han characters cannot be rendered properly
without language information. This idea might lead an implementer to
conclude that language information must always be added to plain text
using the tags. However, this implication is incorrect. The goal and
methods of Han unification were to ensure that the text remained
legible. Although font, size, width, and other format specifications
need to be added to produce precisely the same appearance on the
source and target machines, plain text remains legible in the absence
of these specifications. There should never be any confusion in
Unicode, because the distinctions between the unified characters are
all within the range of stylistic variations that exist in each
country. No unification in Unicode should make it impossible for a
reader to identify a character if it appears in a different font.
Where precise font information is important, it is best conveyed in a
rich text format.


More information about the Gemini mailing list