[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

Mon Jun 10 19:24:36 CEST 2013

Mathias Panzenböck writes:

 > > Some Japanese people still refuse to use Unicode because of the
 > > Unihan controversy. Briefly: Characters like 刃 (U+5203) are
 > > drawn differently in Japanese and Chinese, but Unicode considers
 > > them the same character (to get the Chinese variation, you have
 > > to use a Chinese font). This is a problem—but Shift-JIS has the
 > > exact same problem.
 > 
 > That's what I meant, but I thought Shift-JIS doesn't have this
 > problem? I don't work with such encodings, I just read about that
 > problems.

It depends on the font format and font selection algorithm.  20 years
ago Shift JIS was less likely to have the problem because a legacy-
format font used Shift JIS directly as an index into the glyph table,
and nobody who wasn't Japanese used Shift JIS, so you could bet on a
Japanese font.

10 years ago, Type 1 CID fonts and TrueType fonts which indirectly
index by translating character codes to glyph indexes, then looking
them up became popular.  Many times configuration was done poorly (for
example, many Chinese fonts claim to be able to represent Japanese,
which is true but ugly), and rendering engines often made poor
choices, even though you could almost always make an accurate guess as
to which language was being rendered from the character encoding.

Today rendering is slowing improving, but you still have the problem
that because Japanese and Chinese prefer different styles in drawing
the glyphs, some fonts are more appropriate for Japanese than for
Chinese and vice versa, but systems aren't very often configured to
make the fine distinctions automatically for multilingual users.

I have heard that the same problem occurs in very nice fonts for Latin
characters.  Some languages consider umlauts and other diacritics to
be part of the character, others consider them to be additions.  The
former languages tend to prefer fonts with less space between the base
character and the diacritical mark than the latter.  (So I have heard,
but I've also heard it's B.S. ;-)

Anyway, this is way OT.  If you want to know more about Asian
character encodings and related topics like fonts, Ken Lunde's _CJKV
Information Processing_ is the bible.