[Python-ideas] Unicode surrogateescape [was: Re: Python 3000 TIOBE -3%]

Cameron Simpson cs at zip.com.au
Thu Feb 16 00:07:49 CET 2012


On 14Feb2012 10:17, Carl M. Johnson <cmjohnson.mailinglist at gmail.com> wrote:
| On Feb 14, 2012, at 10:04 AM, Jim Jewett wrote:
| > But is there a good reason not to change the default errorhandler to
| > errors="surrogateescape"?
| 
| It's a conflict in the Zen:
| 
| > Errors should never pass silently.
| > Unless explicitly silenced.
| 
| OK, so default to strict. But:

Yes.

| > Although practicality beats purity.
| 
| Hmm, so maybe do use surrogates. Then again:

No. Adding errors="surrogateescape" when needed is easy enough not to be
impractical.

(Also, it clearly flags in the code that we won't always get what we
expect/hope.)

| > In the face of ambiguity, refuse the temptation to guess.
| 
| Grr, I'm not nearly Dutch enough to make sense of this logical conflict!

I'm not Dutch either (I can never remember which way P and V go in
semaphore operations, for example). However, the logic I would use is
very simple:

  I should know the encoding of these bytes.

  If I don't, and I merely have to suck them in and spit them back out again
  as bytes undamaged (such as when reading filesystem filenames, which can
  often be treated as opaque tokens), use errors="surrogateescape".

  Otherwise, arrange to know the encoding (or have enough fiat to declare
  one, preferably utf-8).

errors="surrogateescape" is for lossless but usually "blind"
decode/encode. The rest of the time it would be better to know what
you're doing.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

We don't just *borrow* words; on occasion, English has pursued other
languages down alleyways to beat them unconscious and rifle their pockets for
new vocabulary. - James D. Nicoli



More information about the Python-ideas mailing list