[Python-Dev] PEP 383 update: utf8b is now the error handler

Tue May 5 19:31:28 CEST 2009

MRAB writes:

 > > I don't think "people shouldn't be using non-ASCII-compatible
 > > encodings for locale encodings" is a sufficient rationale for a hard
 > > error here.  I mean, of course they *should* be using UTF-8.  Maybe
 > > Python 3.1 should just go ahead and error on any other encoding on
 > > POSIX platforms? <wink>
 > > 
 > I don't see why the error handler couldn't in principle be used with
 > encodings other than UTF-8, although in that case all of the low
 > surrogates should be open to use.

I should have been more clear here, I guess.  The error handler *can*,
and in the PEP *will be* by default, used with all "sane" locale
encodings on POSIX.

    It occurs to me that the PEP maybe should say that it is an error
    to have your POSIX locale set to UTF-16 or something like that.

What "sane" means in this context is

1.  ASCII NUL is the bytearray terminator, and can't be used as a byte
    in a file name.  This rules out UTF-16, UTF-32, and widechar EUC
    encodings, as well as some very rare ones.

2.  An ASCII character always translates to the Unicode character with
    the same code (ie, "to itself").  It is not a part of other
    sequences (control sequences, or a trailing byte).  This rules out
    EBCDIC, ISO-2022-*, Shift JIS, and Big5, among the encodings I'm
    familiar with.  EBCDIC because only by accident will an EBCDIC
    character map to the same ASCII character with the same code.  The
    ISO-2022-* encodings are out because ASCII characters are used in
    escape sequences.  Shift JIS and Big5 because in those encodings,
    a high-bit-set octet signals the start of a multibyte sequence,
    and some of the trailing bytes may be in the ASCII range.

What's left?  Well, UTF-8, all of the ISO-8859 sets, several national
standards (such as the KOI8 family for Cyrillic), IBM and Microsoft
"code pages", and the "packed" EUC encodings used for Japanese,
Chinese, and Korean.  These all have the character that ASCII is
ASCII, and all non-ASCII characters are encoded using only
high-bit-set octets.  In fact, in practice, on Unix these are
invariably what you encounter.

So what's the problem?  Backward compatibility for Microsoft OSes,
which not only used to use MBCS national character sets, but
"cleverly" packed more characters into the encoding by using ASCII as
trailing bytes.  Ie, the aforementioned "insane" Shift JIS (which is
mandated by the leading Japanese cellphone service provider even
today) and Big5 (the leading encoding for Chinese until very
recently).  These are very commonly found on archival media, and even
on USB keys and so on which tend to be FAT-formatted.  This doesn't
prevent usage of the Unicode APIs, but up to Windows 2000 most
Japanese vendors' OEM version of Windows used FAT format and Shift JIS
as the file system encoding, and I know of Japanese offices where
Windows 98 systems were in use as recently as early 2007.

It's the removable media which are the problem, because on Windows you
just use the Unicode APIs.  But they're not available on Unix, so you
need the byte-oriented APIs.

Is this a real problem?  I don't know, I don't do Windows, I don't do
computing with my cellphone, and I don't need to get Japanese (that
might be mixed with Russian ones!!) filenames off of ancient media or
CIFS fileshares using Shift JIS.  I guess it's possible that
cellphones do everything *except* add filenames to directories in
Shift JIS, but the filenames are in UTF-16.

OTOH, it seems to me that an *optional* extension to handling error on
ASCII is technically feasible and would be nearly trivial to add to
the PEP.  The biggest cost would be adding the error argument to
various functions (as Zooko requested) so that
surrogate-replace-extended could be specified if needed.

 > > Footnotes: 
 > > [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least
 > >     once, in section 16.6, but the context is such that I take it to
 > >     refer to "half of the surrogate area".  Section 3.8 doesn't use
 > >     these, instead noting that "leading" and "trailing" are sometimes
 > >     used instead of "high" and "low".  Better to avoid the word "half"
 > >     in PEP 383, I think.
 > > 
 > "Leading" and "trailing" simply state the order, not the set ("high" or
 > "low"), so are not good terms to use.

But it's the order that's important.  If you've just finished reading
a character, and encounter a trailing surrogate, then it was produced
by the 'utf8b' error handler; nothing else in a Python codec can do
that.  If you've just finished reading a character, are in a UTF-16
Python, and encounter a leading surrogate, then you immediately gobble
the following code, which must be a trailing surrogate, and combine
them to produce a character.  The remaining case is that you encounter
a valid character.  Anything else is an error, and (assuming no bugs),
no Python codec will produce anything else.

 > >     This does imply that programs that take advantage of the error
 > >     handler specified in this PEP are on their own if they accept data
 > >     from any sources that are not known to be Unicode-conforming.
 > >     OTOH, as far as I can see if other sources are known to be Unicode
 > >     conformant, it's reasonably (but not perfectly) safe to combine
 > >     them with strings from this PEP (and of course use either 'utf8b'
 > >     or 'strict', as appropriate, when passing data out of Python).
 > > 
 > Should there be a function or method to check for conformance and
 > lone surrogates?

string.encode('utf-8',errors=strict) will do for now.