[Python-ideas] Py3 unicode impositions

Mon Feb 13 06:42:00 CET 2012

Paul Moore writes:

 > > But you obviously do know the convention -- use UTF-8.
 > 
 > No. I know that a lot of Unix people advocate UTF-8, and I gather it's
 > rapidly becoming standard in the Unix world. But I work on Windows,
 > and UTF-8 is not the standard there. I have no idea if UTF-8 is
 > accepted cross-platform,

It is.  All of Microsoft's programs (and I suppose most third-party
software, too) that I know of will happily import UTF-8-encoded text,
and produce it as well.  Most Microsoft-specific file formats (eg,
Word) use UTF-16 internally, but they can't be read by most
text-oriented programs, so in practice they're app/octet-strm.

The problem is the one you point out: files you receive from third
parties are still fairly likely to be in a non-Unicode encoding.

 > Fair comment. My point here is that I *am* dealing with "legacy" data
 > in your sense. And I do so on a day to day basis. UTF-8 is very, very
 > rare in my world (Windows). Latin-1 (or something close) is common.
 > 
 > There is no cross-platform standard yet. And probably won't be until
 > Windows moves to UTF-8 as the standard encoding. Which ain't happening
 > soon.

True.  But for personal use, and for communicating with people you
have some influence over, you can use/recommend UTF-8 safely as far I
know.  I occasionally get asked by Japanese people why files I send in
UTF-8 are broken; it invariably turns out that they sent me a file in
Shift JIS that contained a non-JIS (!) character and my software
translated it to REPLACEMENT CHARACTER before sending as UTF-8.

 > I think people are much more aware of the issues, but cross-platform
 > handling remains a hard problem. I don't wish to make assumptions, but
 > your insistence that UTF-8 is a viable solution suggests to me that
 > you don't know much about the handling of Unicode on Windows. I wish I
 > had that luxury...

I don't understand what you mean by that.  Windows doesn't make
handling any non-Unicode encodings easy, in my experience, except for
the local code page.  So, OK, if you're in a monolingual Windows
environment (eg, the typical Japanese office), everybody uses a common
legacy encoding for file exchange (including URLs and MIME filename=
:-(, in particular Shift JIS), and only that encoding works well (ie,
without the assistance of senior tech support personnel).  Handling
Unicode, though, isn't really an issue; all of Microsoft's programs
happily deal with UTF-8 and UTF-16 (in its several varieties).

 > And that's even without all this foreign UTF-8 I get from the Unix
 > guys :-) Apart from the blasted UTF-16, all of it's "ASCII most of
 > the time".

Indeed.  Do you really see UTF-16 in files that you process with
Python?