Mailman 3 Unicode surrogateescape [was: Re: Python 3000 TIOBE -3%] - Python-ideas

Feb. 14, 2012

      On Mon, Feb 13, 2012 at 12:16 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
...
Really, Python 3 forces programmers ...
to make the choice between the 4 possible options for
processing ASCII-compatible encodings:
...
1. Process them as binary data.
[Code smell from lying; lots of pain from mismatch with external libraries.]
...
2. Process them as "latin-1".
[Code smell from lying; non-ASCII often turns to gibberish.]
...
3. Process them as "ascii+surrogateescape". This is the *right*
answer if you plan solely to manipulate the text and then write it back
out in the same encoding as was originally received.
[Note that the original "encoding" may well be internally
inconsistent; I've often seen that in log files.]
...
You will get errors if you try to write a string with escaped
characters out to a non-ascii channel or an ascii channel
without surrogateescape enabled. ... (e.g. sys.stdout)
Is there any reason not to enable surrogate escape by default?  At
least on the console/terminal?

I can see an argument for replace or xmlcharreplace or something more
complicated, but ... if I'm sending output to myself, I would rather
see it (possibly with a mark indicating where it was corrupted) than
to get my program aborted (strict) and *not* be told what data caused
the problem.
...
4. Get a third party encoding guessing library and use that instead of
waving away the problem of ASCII-incompatible encodings.
And I do think this needs to stay 3rd-party; domain information
matters, and n-gram guessing should not be subject to stability
guarantees.

-jJ

Unicode surrogateescape [was: Re: Python 3000 TIOBE -3%]

Jim Jewett

tags

participants (1)