Unicode surrogateescape [was: Re: Python 3000 TIOBE -3%]

On Mon, Feb 13, 2012 at 12:12 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Paul Moore writes:
The preferred system encoding is indeed better than universal ASCII. But is there a good reason not to change the default errorhandler to errors="surrogateescape"? errors="strict" is already well-documented, and the sort of people most eager to reject (rather than ignore) bad data also tend to be explicit about their use of defaults. And if the barrier is only backwards-compatibility, is there any reason not to at least recommend a recipe of errors="surrogateescape" for cases where you expect ASCII, but want to round-trip other data just in case? -jJ

On Feb 14, 2012, at 10:04 AM, Jim Jewett wrote:
But is there a good reason not to change the default errorhandler to errors="surrogateescape"?
It's a conflict in the Zen:
Errors should never pass silently. Unless explicitly silenced.
OK, so default to strict. But:
Although practicality beats purity.
Hmm, so maybe do use surrogates. Then again:
In the face of ambiguity, refuse the temptation to guess.
Grr, I'm not nearly Dutch enough to make sense of this logical conflict!

On 14Feb2012 10:17, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote: | On Feb 14, 2012, at 10:04 AM, Jim Jewett wrote: | > But is there a good reason not to change the default errorhandler to | > errors="surrogateescape"? | | It's a conflict in the Zen: | | > Errors should never pass silently. | > Unless explicitly silenced. | | OK, so default to strict. But: Yes. | > Although practicality beats purity. | | Hmm, so maybe do use surrogates. Then again: No. Adding errors="surrogateescape" when needed is easy enough not to be impractical. (Also, it clearly flags in the code that we won't always get what we expect/hope.) | > In the face of ambiguity, refuse the temptation to guess. | | Grr, I'm not nearly Dutch enough to make sense of this logical conflict! I'm not Dutch either (I can never remember which way P and V go in semaphore operations, for example). However, the logic I would use is very simple: I should know the encoding of these bytes. If I don't, and I merely have to suck them in and spit them back out again as bytes undamaged (such as when reading filesystem filenames, which can often be treated as opaque tokens), use errors="surrogateescape". Otherwise, arrange to know the encoding (or have enough fiat to declare one, preferably utf-8). errors="surrogateescape" is for lossless but usually "blind" decode/encode. The rest of the time it would be better to know what you're doing. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ We don't just *borrow* words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary. - James D. Nicoli

On Feb 14, 2012, at 10:04 AM, Jim Jewett wrote:
But is there a good reason not to change the default errorhandler to errors="surrogateescape"?
It's a conflict in the Zen:
Errors should never pass silently. Unless explicitly silenced.
OK, so default to strict. But:
Although practicality beats purity.
Hmm, so maybe do use surrogates. Then again:
In the face of ambiguity, refuse the temptation to guess.
Grr, I'm not nearly Dutch enough to make sense of this logical conflict!

On 14Feb2012 10:17, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote: | On Feb 14, 2012, at 10:04 AM, Jim Jewett wrote: | > But is there a good reason not to change the default errorhandler to | > errors="surrogateescape"? | | It's a conflict in the Zen: | | > Errors should never pass silently. | > Unless explicitly silenced. | | OK, so default to strict. But: Yes. | > Although practicality beats purity. | | Hmm, so maybe do use surrogates. Then again: No. Adding errors="surrogateescape" when needed is easy enough not to be impractical. (Also, it clearly flags in the code that we won't always get what we expect/hope.) | > In the face of ambiguity, refuse the temptation to guess. | | Grr, I'm not nearly Dutch enough to make sense of this logical conflict! I'm not Dutch either (I can never remember which way P and V go in semaphore operations, for example). However, the logic I would use is very simple: I should know the encoding of these bytes. If I don't, and I merely have to suck them in and spit them back out again as bytes undamaged (such as when reading filesystem filenames, which can often be treated as opaque tokens), use errors="surrogateescape". Otherwise, arrange to know the encoding (or have enough fiat to declare one, preferably utf-8). errors="surrogateescape" is for lossless but usually "blind" decode/encode. The rest of the time it would be better to know what you're doing. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ We don't just *borrow* words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary. - James D. Nicoli
participants (3)
-
Cameron Simpson
-
Carl M. Johnson
-
Jim Jewett