Oren Tirosh
With the ability to embed functions inside a charmap big5 and other encodings could be converted to be charmap based, too :-)
This is precisely what PEP 293 does: allow to embed functions in any codec.
I just feel that there must be *some* simpler way.
Why do you think so? It is not difficult.
A patch with 87k of code scares the hell out of me.
Ah, so it is the size of the patch? Some of it could be moved to Python perhaps, thus reducing the size of the patch (e.g. the registry comes to mind) If you look at the patch, you see that it precisely does what you propose to do: add a callback to the charmap codec: - it deletes charmap_decoding_error - it adds state to feed the callback function - it replaces the old call to charmap_decoding_error by ! outpos = p-PyUnicode_AS_UNICODE(v); ! startinpos = s-starts; ! endinpos = startinpos+1; ! if (unicode_decode_call_errorhandler( ! errors, &errorHandler, ! "charmap", "character maps to <undefined>", ! starts, size, &startinpos, &endinpos, &exc, &s, ! (PyObject **)&v, &outpos, &p)) {# (original code was) ! if (charmap_decoding_error(&s, &p, errors, ! "character maps to <undefined>")) { - likewise for encoding. Now, apply the same change to all other codecs (as you propose to do for big5), and you obtain the patch for PEP 293. In doing so, you find that the modifications needed for each codec are so similar that you add some supporting infrastructure, and correct errors in the existing codecs that you spot, and so on. The diffstat is Include/codecs.h | 37 Include/pyerrors.h | 67 + Lib/codecs.py | 5 Modules/_codecsmodule.c | 61 + Objects/stringobject.c | 7 Objects/unicodeobject.c | 1794 +++++++++++++-------!!!!!!!!!!!!!!!!!!!!!!!!!!!! Python/codecs.c | 399 ++++++++++ Python/exceptions.c | 603 ++++++++++++++++ 8 files changed, 1678 insertions(+), 236 deletions(-), 1059 modifications(!) If you look at the large blocks of new code, you find that it is in - charmap_encoding_error, which insists on implementing known error handling algorithms inline, - the default error handlers, of which atleast PyCodec_XMLCharRefReplaceErrors should be pure-Python - PyCodec_BackslashReplaceErrors, likewise, - the UnicodeError exception methods (which could be omitted, IMO). So, if you look at the patch, it isn't really that large. Regards, Martin