[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

Mon Jun 10 20:50:33 CEST 2013

From: Mathias Panzenböck <grosser.meister.morti at gmx.net>

Sent: Monday, June 10, 2013 9:03 AM

> On 06/09/2013 09:22 AM, Andrew Barnert wrote:
>>  Some Japanese people still refuse to use Unicode because of the Unihan 
>> controversy. Briefly: Characters like 刃 (U+5203) are drawn differently in 
>> Japanese and Chinese, but Unicode considers them the same character (to get the 
>> Chinese variation, you have to use a Chinese font). This is a problem—but 
>> Shift-JIS has the exact same problem.
> 
> That's what I meant, but I thought Shift-JIS doesn't have this problem? 
> I don't work with such encodings, I just read about that problems.

Just like Unicode, Shift-JIS only has one character for this kanji, and you have to use out-of-band meta-textual information to determine whether to display the Chinese or Japanese version.

Of course in Unicode, it's a script tag or file metadata or user preference setting that controls which font is used; in Shift-JIS, the fact that nobody uses Shift-JIS for Chinese is generally all the information you need. But, either way, if you want to write "I could tell my pen-pal was a Chinese spy because she wrote 刃 instead of 刃", you can't.

> See also "More Information" here:> http://support.microsoft.com/kb/170559
> ...which isn't where I read about this initially. I can't find where I first read about it.

Note that the "products this article applies to" list is "Microsoft Platform Software Development Kit-January 2000 Edition". The problem was mostly fixed in Unicode 2.0, but Windows ME and 2000 had only partial support for 2.0. While they could display SIP characters, their codepage maps weren't updated to make use of them. So, the Shift-JIS (and Big-5, etc.) mappings were ambiguous—two different Shift-JIS characters mapped to the same Unicode character. Microsoft fixed that in XP and 2003 by upgrading to Unicode 3.0 and implementing the correct mappings.

If you still need to support Windows 2000 or 9x/ME or CE 3.0, or apps built for them, it still occasionally shows up today. Classic Mac OS and Palm OS had smaller problems, but nobody cares about those platforms anymore anyway. Pretty much every other platform either ignored Unicode until well after 2.0, or went for UCS-4 or UTF-8 from the start, making the Unicode 2.0 upgrade much easier.