Unicode surrogateescape [was: Re: Python 3000 TIOBE -3%]

On Mon, Feb 13, 2012 at 12:16 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
1. Process them as binary data.
[Code smell from lying; lots of pain from mismatch with external libraries.]
2. Process them as "latin-1".
[Code smell from lying; non-ASCII often turns to gibberish.]
[Note that the original "encoding" may well be internally inconsistent; I've often seen that in log files.]
Is there any reason not to enable surrogate escape by default? At least on the console/terminal? I can see an argument for replace or xmlcharreplace or something more complicated, but ... if I'm sending output to myself, I would rather see it (possibly with a mark indicating where it was corrupted) than to get my program aborted (strict) and *not* be told what data caused the problem.
4. Get a third party encoding guessing library and use that instead of waving away the problem of ASCII-incompatible encodings.
And I do think this needs to stay 3rd-party; domain information matters, and n-gram guessing should not be subject to stability guarantees. -jJ
participants (1)
-
Jim Jewett