[Python-ideas] Unicode surrogateescape [was: Re: Python 3000 TIOBE -3%]

Tue Feb 14 21:20:23 CET 2012

On Mon, Feb 13, 2012 at 12:16 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> Really, Python 3 forces programmers ...
> to make the choice between the 4 possible options for
> processing ASCII-compatible encodings:

> 1. Process them as binary data.

[Code smell from lying; lots of pain from mismatch with external libraries.]

> 2. Process them as "latin-1".

[Code smell from lying; non-ASCII often turns to gibberish.]

> 3. Process them as "ascii+surrogateescape". This is the *right*
> answer if you plan solely to manipulate the text and then write it back
> out in the same encoding as was originally received.

[Note that the original "encoding" may well be internally
inconsistent; I've often seen that in log files.]

> You will get errors if you try to write a string with escaped
> characters out to a non-ascii channel or an ascii channel
> without surrogateescape enabled. ... (e.g. sys.stdout)

Is there any reason not to enable surrogate escape by default?  At
least on the console/terminal?

I can see an argument for replace or xmlcharreplace or something more
complicated, but ... if I'm sending output to myself, I would rather
see it (possibly with a mark indicating where it was corrupted) than
to get my program aborted (strict) and *not* be told what data caused
the problem.

> 4. Get a third party encoding guessing library and use that instead of
> waving away the problem of ASCII-incompatible encodings.

And I do think this needs to stay 3rd-party; domain information
matters, and n-gram guessing should not be subject to stability
guarantees.

-jJ