[Python-Dev] PEP 383 update: utf8b is now the error handler

Michael Urman murman at gmail.com
Thu May 7 03:05:42 CEST 2009


On Wed, May 6, 2009 at 15:42, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Despite there being also an error handler called "surrogates".

Not that I have to be, but I'm not sold on the previous UTF-8 codec
behavior becoming an error handler of the name "surrogates" for two
reasons (I do respect the obvious PBP argument for the implementation,
and have no better name - "lenient"?).

First, unless there's a way to stack error handlers, there's no way to
access the old behavior combined with the "replace" handler. Second,
errors="surrogates" reads like surrogates should be an error, not an
additionally allowed pattern. Neither of these are deal breakers or
hard to learn, but they are non-obvious. I think the utf8b behavior
makes a lot more sense with the name "surrogates", through the
mnemonic that errors become surrogates.

The stacking argument also applies to the new utf8b behavior on encode
(only, as it handles all errors on decode). This may be a YAGNI, but
for a non-UTF-8 encode, it may be useful to allow "xmlcharrefreplace"
handling for unavailable non-surrogate-escaped characters. But without
stacking that's unmaintainable, as we clearly don't want ${codec}b for
all current codecs.

I'd be perfectly happy with utf8b or UTF-8b, as either a codec or an
error handler (do we want both? YAGNI?). So what if it smells a little
inaccurate as a handler when used with codecs other than UTF-8, no big
deal. I could also see something like errors="roundtrip" which
explains the intention of the handler rather than the algorithm, but
is awkward on encode when it encounters unavailable Unicode
characters.

-- 
Michael Urman


More information about the Python-Dev mailing list