[Python-ideas] Py3 unicode impositions

Tue Feb 14 10:36:54 CET 2012

Paul Moore writes:

 > Basically, In my experience, Windows users are not likely to produce
 > UTF-8 formatted files unless they make specific efforts to do so.

Agreed.  All I meant was that if you make the effort to do so, your
Windows-based correspondents will be able to read it, and vice versa.

 > As I say:
 > - I know what to do
 > - It can be a lot of work
 > - Frankly, the damage is minor (these are usually personal or low-risk scripts)
 > - The temptation to say "stuff it" and get on with my life is high
 > - It frustrates me that Python by default tempts me to *not* do the right thing

Please don't blame it on Python.  Python tempts you because it offers
the choice to do it right.  There is no way that Python can do it
right *for* you, not even all the resources Microsoft or Apple can
bring to bear have managed to do it right (you can't get 100% even
within an all-Windows or all-Mac shop, let alone cross-platform).  Not
yet; it requires your help.

Thanks for caring!<wink/>

 > Maybe it's different in Japan, where character sets are more of a
 > common knowledge issue?

Mojibake is common knowledge in Japan; what to do about it requires a
specialized technical background.

 > But if I tried to say to one of my colleagues that the spooled
 > output of a SQL query they sent me (from a database with one
 > encoding, through a client with no real encoding handling beyond
 > global OS-level defaults) didn't use UTF-8, I'd get a blank look at
 > best.

Again, this is not the direction I have in mind (I'm thinking more in
terms of the RightThinkingAmongUs using UTF-8 as much as possible, and
whether the recipients will be able to read it -- AFAICT/IME they
can), and you certainly shouldn't presume that your correspondents
"should" "already" be using UTF-8.  That would be seriously rude on
Windows, where as you point out one has to do something rather
contorted to produce UTF-8 in most applications.

 > What I was trying to say was that typical Windows environments (where
 > people don't interact often with Unix utilities, or if they do it's
 > with ASCII characters almost exclusively) hide the details of Unicode
 > from the end user to the extent that they don't know what's going on
 > under the hood, and don't need to care.

Ah.  If you're in a monolingual environment, yes, it works that way.
But it works just well on Unix if you set LANG appropriately in your
environment.

 > Much like Python 2, I guess :-)

No, Python 2 is better and worse.  Many protocols use magic numbers
that look like ASCII-encoded English (eg, HTML tags).  Python 2 is
quite happy to process those magic numbers and the intervening content
(as long as each stretch of non-ASCII is treated as an atomic unit),
regardless of whether actual encoding matches local convention.  (This
is why the WSGI guys love Python 2 -- it can be multilingual without
knowing the encoding!)  On the other hand, the Windows environment
will be more seamless (and allow useful processing of the "intervening
content") as long as you stick to the local convention for encoding.