[Python-ideas] Py3 unicode impositions
Steven D'Aprano
steve at pearwood.info
Sun Feb 12 06:03:01 CET 2012
Paul Moore wrote:
> My concern about Unicode in Python 3 is that the principle is, you
> specify the right encoding. But often, I don't *know* the encoding ;-(
> Text files, like changelogs as a good example, generally have no
> marker specifying the encoding, and they can have all sorts (depending
> on where the package came from). Worse, I am on Windows and changelogs
> usually come from Unix developers - so I'm not familiar with the
> common conventions ("well, of course it's in UTF-8, that's what
> everyone uses"...)
<raises eyebrow>
But you obviously do know the convention -- use UTF-8.
> In Python 2, I can ignore the issue. Sure, I can end up with mojibake,
> but for my uses, that's not a disaster. Mostly-readable works. But in
> Python 3, I get an error and can't process the file.
>
> I can just use latin-1, or surrogateescape. But that doesn't come
> naturally to me yet. Maybe it will in time... Or maybe there's a
> better solution I don't know about yet.
So why don't you use UTF-8?
As far as those who actually don't know the convention, isn't it better to
teach them the convention "use UTF-8, unless dealing with legacy data" rather
than to avoid dealing with the issue by using errors='surrogateescape'?
I'd hate for "surrogateescape" to become the One Obvious Way for dealing with
unknown encodings, because this is 2012 and people should be more savvy about
non-ASCII characters by now. I suppose it's marginally better than just
throwing them away with errors='ignore', but still.
I recently bought a book from Amazon UK. It was £12 not \udcc2\udca312.
This isn't entirely a rhetorical question. I'm not on Windows, so perhaps
there's a problem I'm unaware of.
--
Steven
More information about the Python-ideas
mailing list