[Python-ideas] Py3 unicode impositions
Stephen J. Turnbull
stephen at xemacs.org
Mon Feb 13 06:42:00 CET 2012
Paul Moore writes:
> > But you obviously do know the convention -- use UTF-8.
>
> No. I know that a lot of Unix people advocate UTF-8, and I gather it's
> rapidly becoming standard in the Unix world. But I work on Windows,
> and UTF-8 is not the standard there. I have no idea if UTF-8 is
> accepted cross-platform,
It is. All of Microsoft's programs (and I suppose most third-party
software, too) that I know of will happily import UTF-8-encoded text,
and produce it as well. Most Microsoft-specific file formats (eg,
Word) use UTF-16 internally, but they can't be read by most
text-oriented programs, so in practice they're app/octet-strm.
The problem is the one you point out: files you receive from third
parties are still fairly likely to be in a non-Unicode encoding.
> Fair comment. My point here is that I *am* dealing with "legacy" data
> in your sense. And I do so on a day to day basis. UTF-8 is very, very
> rare in my world (Windows). Latin-1 (or something close) is common.
>
> There is no cross-platform standard yet. And probably won't be until
> Windows moves to UTF-8 as the standard encoding. Which ain't happening
> soon.
True. But for personal use, and for communicating with people you
have some influence over, you can use/recommend UTF-8 safely as far I
know. I occasionally get asked by Japanese people why files I send in
UTF-8 are broken; it invariably turns out that they sent me a file in
Shift JIS that contained a non-JIS (!) character and my software
translated it to REPLACEMENT CHARACTER before sending as UTF-8.
> I think people are much more aware of the issues, but cross-platform
> handling remains a hard problem. I don't wish to make assumptions, but
> your insistence that UTF-8 is a viable solution suggests to me that
> you don't know much about the handling of Unicode on Windows. I wish I
> had that luxury...
I don't understand what you mean by that. Windows doesn't make
handling any non-Unicode encodings easy, in my experience, except for
the local code page. So, OK, if you're in a monolingual Windows
environment (eg, the typical Japanese office), everybody uses a common
legacy encoding for file exchange (including URLs and MIME filename=
:-(, in particular Shift JIS), and only that encoding works well (ie,
without the assistance of senior tech support personnel). Handling
Unicode, though, isn't really an issue; all of Microsoft's programs
happily deal with UTF-8 and UTF-16 (in its several varieties).
> And that's even without all this foreign UTF-8 I get from the Unix
> guys :-) Apart from the blasted UTF-16, all of it's "ASCII most of
> the time".
Indeed. Do you really see UTF-16 in files that you process with
Python?
More information about the Python-ideas
mailing list