[Python-ideas] Py3 unicode impositions

Sun Feb 12 13:54:13 CET 2012

On 12 February 2012 05:03, Steven D'Aprano <steve at pearwood.info> wrote:
> Paul Moore wrote:
>
>> My concern about Unicode in Python 3 is that the principle is, you
>> specify the right encoding. But often, I don't *know* the encoding ;-(
>> Text files, like changelogs as a good example, generally have no
>> marker specifying the encoding, and they can have all sorts (depending
>> on where the package came from). Worse, I am on Windows and changelogs
>> usually come from Unix developers - so I'm not familiar with the
>> common conventions ("well, of course it's in UTF-8, that's what
>> everyone uses"...)
>
>
> <raises eyebrow>
>
> But you obviously do know the convention -- use UTF-8.

No. I know that a lot of Unix people advocate UTF-8, and I gather it's
rapidly becoming standard in the Unix world. But I work on Windows,
and UTF-8 is not the standard there. I have no idea if UTF-8 is
accepted cross-platform, or if it's just what has grown as most
ChangeLog files are written on Unix and Unix users don't worry about
what's convenient on Windows (no criticism there, just acknowledgement
of a fact). And I have seen ChangeLog files with non-UTF-8 encodings
of names in them. I have no idea if that's a bug or just a preference
- and anyway, "be permissive in what you accept" applies...

Get beyond ChangeLog files and it's anybody's guess. My PC has text
files from many, many places (some created on my PC, some created by
others on various flavours and ages of Unix , and some downloaded from
who-knows-where on the internet). Not one of them comes with an
encoding declaration. Of course every file is encoded in some way. But
it's incredibly naive to assume the user knows that encoding. Hey, I
still have to dump out the content of files to check the line ending
convention when working in languages other than Python - universal
newlines saves me needing to care about that, why is it so disastrous
to consider having something similar for encodings?

>> In Python 2, I can ignore the issue. Sure, I can end up with mojibake,
>> but for my uses, that's not a disaster. Mostly-readable works. But in
>> Python 3, I get an error and can't process the file.
>>
>> I can just use latin-1, or surrogateescape. But that doesn't come
>> naturally to me yet. Maybe it will in time... Or maybe there's a
>> better solution I don't know about yet.
>
> So why don't you use UTF-8?

Decoding errors.

> As far as those who actually don't know the convention, isn't it better to
> teach them the convention "use UTF-8, unless dealing with legacy data"
> rather than to avoid dealing with the issue by using
> errors='surrogateescape'?

Fair comment. My point here is that I *am* dealing with "legacy" data
in your sense. And I do so on a day to day basis. UTF-8 is very, very
rare in my world (Windows). Latin-1 (or something close) is common.

There is no cross-platform standard yet. And probably won't be until
Windows moves to UTF-8 as the standard encoding. Which ain't happening
soon.

> I'd hate for "surrogateescape" to become the One Obvious Way for dealing
> with unknown encodings, because this is 2012 and people should be more savvy
> about non-ASCII characters by now. I suppose it's marginally better than
> just throwing them away with errors='ignore', but still.

I think people are much more aware of the issues, but cross-platform
handling remains a hard problem. I don't wish to make assumptions, but
your insistence that UTF-8 is a viable solution suggests to me that
you don't know much about the handling of Unicode on Windows. I wish I
had that luxury...

> I recently bought a book from Amazon UK. It was £12 not \udcc2\udca312.

£12 in what encoding? :-)

> This isn't entirely a rhetorical question. I'm not on Windows, so perhaps
> there's a problem I'm unaware of.

I think that's the key here. Even excluding places that don't use the
Roman alphabet, Windows encoding handling is complex. CP1252, CP850,
Latin-1, Latin-14 (Euro zone), UTF-16, BOMs. All are in use on my PC
to some extent. And that's even without all this foreign UTF-8 I get
from the Unix guys :-) Apart from the blasted UTF-16, all of it's
"ASCII most of the time".

Paul.