(Simple?) Unicode Question

Sun Aug 30 13:30:48 EDT 2009

On Sun, 30 Aug 2009 02:36:49 +0000, Steven D'Aprano wrote:

>>> So long as your terminal has a sensible encoding, and you have a good
>>> quality font, you should be able to print any string you can create.
>> 
>> UTF-8 isn't a particularly sensible encoding for terminals.
> 
> Did I mention UTF-8?
> 
> Out of curiosity, why do you say that UTF-8 isn't sensible for terminals?

I don't think I've ever seen a terminal (whether an emulator running on a
PC or a hardware terminal) which supports anything like the entire Unicode
repertoire, along with right-to-left writing, complex scripts, etc. Even
support for double-width characters is uncommon.

If your terminal can't handle anything outside of ISO-8859-1, there isn't
any advantage to using UTF-8, and some disadvantages; e.g. a typical Unix
tty driver will delete the last *byte* from the input buffer when you
press backspace (Linux 2.6.* has the IUTF8 flag, but this is non-standard).

Historically, terminal I/O has tended to revolve around unibyte encodings,
with everything except the endpoints being encoding-agnostic. Anything
which falls outside of that is a dog's breakfast; it's no coincidence
that the word for "messed-up text" (arising from an encoding mismatch)
was borrowed from Japanese (mojibake).

Life is simpler if you can use a unibyte encoding. Apart from anything
else, the failure modes tend to be harmless. E.g. you get the wrong glyph
rather than two glyphs where you expected one. On a 7-bit channel, you get
the wrong printable character rather than a control character (this is why
ISO-8859-* reserves \x80-\x9F as control codes rather than using them as
printable characters).

>> And "Unicode font" is an oxymoron. You can merge a whole bunch of fonts
>> together and stuff them into a TTF file; that doesn't make them "a
>> font", though.
> 
> I never mentioned "Unicode font" either. In any case, there's no reason 
> why a skillful designer can't make a single font which covers the entire 
> Unicode range in a consistent style.

Consistency between unrelated scripts is neither realistic nor
desirable.

E.g. Latin fonts tend to use uniform stroke widths unless they're
specifically designed to look like handwriting, whereas Han fonts tend to
prefer variable-width strokes which reflect the direction.

>> The main advantage of using Unicode internally is that you can associate
>> encodings with the specific points where data needs to be converted
>> to/from bytes, rather than having to carry the encoding details around
>> the program.
> 
> Surely the main advantage of Unicode is that it gives you a full and 
> consistent range of characters not limited to the 128 characters provided 
> by ASCII?

Nothing stops you from using other encodings, or from using multiple
encodings. But using multiple encodings means keeping track of the
encodings. This isn't impossible, and it may produce better results (e.g.
no information loss from Han unification), but it can be a lot more work.