[Python-Dev] PEP 383 update: utf8b is now the error handler

Tue May 5 19:45:45 CEST 2009

Stephen J. Turnbull wrote:
> MRAB writes:
> 
>  > > I don't think "people shouldn't be using non-ASCII-compatible
>  > > encodings for locale encodings" is a sufficient rationale for a hard
>  > > error here.  I mean, of course they *should* be using UTF-8.  Maybe
>  > > Python 3.1 should just go ahead and error on any other encoding on
>  > > POSIX platforms? <wink>
>  > > 
>  > I don't see why the error handler couldn't in principle be used with
>  > encodings other than UTF-8, although in that case all of the low
>  > surrogates should be open to use.
> 
> I should have been more clear here, I guess.  The error handler *can*,
> and in the PEP *will be* by default, used with all "sane" locale
> encodings on POSIX.
> 
>     It occurs to me that the PEP maybe should say that it is an error
>     to have your POSIX locale set to UTF-16 or something like that.
> 
> What "sane" means in this context is
> 
> 1.  ASCII NUL is the bytearray terminator, and can't be used as a byte
>     in a file name.  This rules out UTF-16, UTF-32, and widechar EUC
>     encodings, as well as some very rare ones.
> 
[snip]
It might be slightly OT, but sometimes strict UTF-8 encoding is violated
by encoding U+0000 using 2 bytes (0xC0 0x80) so that 0x00 can be used as
a terminator. I think I read that Microsoft sometimes does this.