[Python-Dev] PEP 293, Codec Error Handling Callbacks

Martin v. Loewis martin@v.loewis.de
05 Aug 2002 23:06:25 +0200


Oren Tirosh <oren-py-d@hishome.net> writes:

> With the ability to embed functions inside a charmap big5 and other
> encodings could be converted to be charmap based, too :-)

This is precisely what PEP 293 does: allow to embed functions in any
codec.

> I just feel that there must be *some* simpler way. 

Why do you think so? It is not difficult.

> A patch with 87k of code scares the hell out of me.

Ah, so it is the size of the patch? Some of it could be moved to
Python perhaps, thus reducing the size of the patch (e.g. the registry
comes to mind)

If you look at the patch, you see that it precisely does what you
propose to do: add a callback to the charmap codec:

- it deletes charmap_decoding_error
- it adds state to feed the callback function
- it replaces the old call to charmap_decoding_error by

! 	    outpos = p-PyUnicode_AS_UNICODE(v);
! 	    startinpos = s-starts;
! 	    endinpos = startinpos+1;
! 	    if (unicode_decode_call_errorhandler(
! 		 errors, &errorHandler,
! 		 "charmap", "character maps to <undefined>",
! 		 starts, size, &startinpos, &endinpos, &exc, &s,
! 		 (PyObject **)&v, &outpos, &p)) {#

  (original code was)

! 	    if (charmap_decoding_error(&s, &p, errors, 
! 				       "character maps to <undefined>")) {

- likewise for encoding.

Now, apply the same change to all other codecs (as you propose to do
for big5), and you obtain the patch for PEP 293.

In doing so, you find that the modifications needed for each codec are
so similar that you add some supporting infrastructure, and correct
errors in the existing codecs that you spot, and so on. 

The diffstat is

 Include/codecs.h        |   37
 Include/pyerrors.h      |   67 +
 Lib/codecs.py           |    5
 Modules/_codecsmodule.c |   61 +
 Objects/stringobject.c  |    7
 Objects/unicodeobject.c | 1794 +++++++++++++-------!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 Python/codecs.c         |  399 ++++++++++
 Python/exceptions.c     |  603 ++++++++++++++++
 8 files changed, 1678 insertions(+), 236 deletions(-), 1059 modifications(!)

If you look at the large blocks of new code, you find that it is in

- charmap_encoding_error, which insists on implementing known error
  handling algorithms inline,

- the default error handlers, of which atleast
  PyCodec_XMLCharRefReplaceErrors should be pure-Python

- PyCodec_BackslashReplaceErrors, likewise,

- the UnicodeError exception methods (which could be omitted, IMO).

So, if you look at the patch, it isn't really that large.

Regards,
Martin