[Python-ideas] Unicode surrogateescape [was: Re: Python 3000 TIOBE -3%]
Jim Jewett
jimjjewett at gmail.com
Tue Feb 14 21:20:23 CET 2012
On Mon, Feb 13, 2012 at 12:16 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Really, Python 3 forces programmers ...
> to make the choice between the 4 possible options for
> processing ASCII-compatible encodings:
> 1. Process them as binary data.
[Code smell from lying; lots of pain from mismatch with external libraries.]
> 2. Process them as "latin-1".
[Code smell from lying; non-ASCII often turns to gibberish.]
> 3. Process them as "ascii+surrogateescape". This is the *right*
> answer if you plan solely to manipulate the text and then write it back
> out in the same encoding as was originally received.
[Note that the original "encoding" may well be internally
inconsistent; I've often seen that in log files.]
> You will get errors if you try to write a string with escaped
> characters out to a non-ascii channel or an ascii channel
> without surrogateescape enabled. ... (e.g. sys.stdout)
Is there any reason not to enable surrogate escape by default? At
least on the console/terminal?
I can see an argument for replace or xmlcharreplace or something more
complicated, but ... if I'm sending output to myself, I would rather
see it (possibly with a mark indicating where it was corrupted) than
to get my program aborted (strict) and *not* be told what data caused
the problem.
> 4. Get a third party encoding guessing library and use that instead of
> waving away the problem of ASCII-incompatible encodings.
And I do think this needs to stay 3rd-party; domain information
matters, and n-gram guessing should not be subject to stability
guarantees.
-jJ
More information about the Python-ideas
mailing list