[Python-ideas] Add "htmlcharrefreplace" error handler

Andrew Barnert abarnert at yahoo.com
Sat Jun 15 02:13:48 CEST 2013


From: Amaury Forgeot d'Arc <amauryfa at gmail.com>
Sent: Friday, June 14, 2013 7:31 AM


>2013/6/14 M.-A. Lemburg <mal at egenix.com>
>
>>> By the way, why is it necessary to register?
>>>> Since an error handler is defined by its callback function,
>>>> we could allow functions for the "errors" parameter.
>>>
>>> For the same reason we register modules in sys.modules:
>>> to be able to reference them by name, rather than by object.
>>>
>>> Also note that codecs expect to get the error parameter as string
>>> to keep the API simple and to make short-cuts easy to implement
>>> in the code (esp. in the C implementations).

The simplicity argument is pretty clear. Everywhere the docs/docstrings/comments explain how errors strings work, they'd also have to explain that it can be a callable instead, and that callables don't have to be passed to PyCodec_LookupError/codecs.lookup_error but can (which will return the argument as-is), and …

Less seriously, it would make the analogy between the codec registry and the error handler registry weaker (therefore a bit more to learn), and it would make it a bit harder to distinguish in code between the pre-looked-up string-or-callable PyObject * and the post-looked-up callable PyObject * (something you don't even have to think about today).

But I'm not sure it really saved any effort in implementing codecs. Conceivably, someone could take advantage of the string value of the errors, but everything I can find in a quick skim of _codecmodule.c and unicodeobject.c and everything I could find online does one of three things: (a) ignore it, (b) if (error) handler = PyCodec_LookupError(error), or (c) pass error along untouched to another function which does one of the above.

So really, almost all code both in the stdlib and out would be the same, except that the ones implemented in C would be parsing an "O" arg instead of a "z".

>>Here's the PEP: http://www.python.org/dev/peps/pep-0293/


The PEP doesn't actually explain the rationale for why it doesn't use a more complicated string-or-callable API like the one I described above.

Which is perfectly reasonable. Nobody asked for it until more than a decade later, and I'm not sure how good an idea it is. Borrowing a time machine to add code people will ask for years later is impressive; borrowing a time machine to add explanations for why they won't be able to have it when they ask years later would just be silly.


>import.c was once rewritten to accept PyObject everywhere,
>maybe unicode codecs could have a double API as well?
>Yes, it's a lot of work.

I don't think changing PyCodec*/_codecs/codecs is that much work. (M.-A. Lemburg can correct me if I'm wrong.)

The big problem isn't the fact that the API that every codec—including third-party codecs—must implement has to change. Which means you end up needing two different codec interfaces, two different registries (or one dual-type registry), etc. And I think that parallel system might have to stick around until Py4k, or at least for quite a few 3.x versions.

Plus, you have to think through the API. Does Python or C-API code need to be able to distinguish old-style and new-style codecs? (If not, what happens when you pass an error by callable to what turns out to be an old-style codec? "TypeError" seems like the obvious answer, but then it's not really true that you can pass a callable as an error handler, unless you have some out-of-band knowledge about the codec you're going to be using.) Also: while nearly any third-party codec written in Python would just magically work as a new-API codec, "nearly" isn't good enough. And there's no way to test. Which means all such existing codecs have to be treated as old-API codecs, which sucks.


In other words, even though I don't think it would actually take much work, and I like the idea, I can't see any way of fleshing out the idea that wouldn't make me hate it.

Except for the obvious one: wait until py4k and just break the PyCodec* and codec-implementation interfaces.


More information about the Python-ideas mailing list