[I18n-sig] Extending definition of errors argument to codecs

Tom Emerson tree@basistech.com
Tue, 15 May 2001 15:28:05 -0400

I'd like to propose an extension to the Codec error reporting mechanism:

The 'errors' argument to encode/decode et al. would be much more
useful as a callable object. The current semantics of 'strict',
'ignore', and 'replace' are trivially implemented in this scheme,
while allowing a specific application to extend a codec with custom
error handling if necessary. Something along the lines of:

class CodecError:
    def __call__(self, bytes):

class CodecError_Replace ( CodecError ):
    def __call__(self, bytes):
        return u'\uFFFD'

class CodecError_Strict ( CodecError ):
    def __call__(self, bytes):
        raise UnicodeError, "cannot map byte range to Unicode"

Why would this be useful? I'm working text that purports to be in Big
5, but in fact it is encoded with CP950. CP950 is identical to Big 5
except that it has a handful of extra codepoints in the 0xF9 VDA block
(taken from the Eten extension). When using the current Big 5 codec on
these files I sometimes blow up because of these extended
characters. I would love to be able to do something like:

class CodecError_CP950 ( Codec_Error_Strict ):
    def __call__(self, bytes):
        if bytes == '\xf9\xd6':
            return u'\u7881'
        Codec_Error_Strict.__call__(self, bytes)

This effectively allows me to expand upon the repertoire encoded by a
the codec without modifying the tables and rebuilding (as I do now as
a work around), generating new tables, or whatever else.

Food for thought. The above design is off-the-cuff, but I think it is
close to my thoughts on the matter.

OK, flame away.


Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"