[Python-Dev] PEP 383 update: utf8b is now the error handler

Thu May 7 03:06:05 CEST 2009

Martin v. Löwis wrote:
>>>> The name "utf8b" suggested in the PEP is not in line with the codec
>>>> design
>>> Where is that design documented, and how exactly violates the name
>>> the design (chapter and verse, please).
>> Martin, I designed the whole Python codec machinery
> 
> Not true. PEP 293 was written and designed by Walter Dörwald.

Walter added the generic error handler callback mechanism and
we both worked on their design.

I designed and wrote the codec implementation back in 2000,
which included the whole idea of having codec error handlers in the
first place.

The original implementation only allowed per-codec
error handlers. Walter extended this to build general-purpose
handlers that could be used by many codecs. His original
motivation was to be able to do XML character reference
escaping.

If you don't believe me, go look this up in the repository, the
mailing list archives and the trackers.

>> so even if
>> this is not explicitly written down somewhere, you can take my
>> word for it.
> 
> If the design was specified in writing somewhere, I would probably
> challenge it as obsolete. If it isn't described anywhere, I'll have
> to ignore it.

Ah, lovely attitude.

>> I want to avoid any such confusion with Python codecs and don't
>> understand why you are making a problem out of this.
> 
> Because utf8b (or, perhaps "UTF-8b") is the official name for this
> algorithm:
> 
> http://hyperreal.org/~est/utf-8b/

That's a codec implementing the escaping idea proposed by Markus
Kuhn, not an official reference. AFAIK, the term "UTF-8B" originated
from a "UTF-8 + binary" codec written for iconv:

    http://mail.nl.linux.org/linux-utf8/2006-04/msg00002.html

If it were the official name of an escape algorithm, as you are
suggesting, the inventor Markus Kuhn would probably have chosen
it, but he hasn't... the only reference to it is an email where it
is described as option D for ways of dealing with malformed
UTF-8 data in a decoder:

    http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

Note that this escape method is not applicable for data that
you decode from UTF-8 and then e.g. encode as Latin-1. It only
works as general purpose method if you are decoding and encoding
using the same codec, since it is specifically designed to
assure round-trip safety.

Martin, please stop being silly and just change the name.

Or drop the idea of using an error handler altogether and just let
people use the utf-8b codec you referenced above to solve their
problems whereever and if needed.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 07 2009)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2009-06-29: EuroPython 2009, Birmingham, UK                52 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/