[Python-ideas] Py3 unicode impositions

Steven D'Aprano steve at pearwood.info
Sun Feb 12 06:03:01 CET 2012


Paul Moore wrote:

> My concern about Unicode in Python 3 is that the principle is, you
> specify the right encoding. But often, I don't *know* the encoding ;-(
> Text files, like changelogs as a good example, generally have no
> marker specifying the encoding, and they can have all sorts (depending
> on where the package came from). Worse, I am on Windows and changelogs
> usually come from Unix developers - so I'm not familiar with the
> common conventions ("well, of course it's in UTF-8, that's what
> everyone uses"...)

<raises eyebrow>

But you obviously do know the convention -- use UTF-8.


> In Python 2, I can ignore the issue. Sure, I can end up with mojibake,
> but for my uses, that's not a disaster. Mostly-readable works. But in
> Python 3, I get an error and can't process the file.
> 
> I can just use latin-1, or surrogateescape. But that doesn't come
> naturally to me yet. Maybe it will in time... Or maybe there's a
> better solution I don't know about yet.

So why don't you use UTF-8?

As far as those who actually don't know the convention, isn't it better to 
teach them the convention "use UTF-8, unless dealing with legacy data" rather 
than to avoid dealing with the issue by using errors='surrogateescape'?

I'd hate for "surrogateescape" to become the One Obvious Way for dealing with 
unknown encodings, because this is 2012 and people should be more savvy about 
non-ASCII characters by now. I suppose it's marginally better than just 
throwing them away with errors='ignore', but still.

I recently bought a book from Amazon UK. It was £12 not \udcc2\udca312.

This isn't entirely a rhetorical question. I'm not on Windows, so perhaps 
there's a problem I'm unaware of.


-- 
Steven



More information about the Python-ideas mailing list