[I18n-sig] Extending definition of errors argument to codecs

M.-A. Lemburg mal@lemburg.com
Tue, 15 May 2001 22:12:22 +0200


Tom Emerson wrote:
> 
> I'd like to propose an extension to the Codec error reporting mechanism:
> 
> The 'errors' argument to encode/decode et al. would be much more
> useful as a callable object. The current semantics of 'strict',
> 'ignore', and 'replace' are trivially implemented in this scheme,
> while allowing a specific application to extend a codec with custom
> error handling if necessary. 

This has been proposed some months ago already. The problem with
this approach is that it seriously breaks binary compatibility
at the C level, since all C APIs use const char *error.

The call interface would also have to be a little more context
aware, so that the callback actually has a chance of modifying
the current codec process -- simply returning a usable
replacement character isn't enough in the general case where
might want to be able to resync with input stream in case there's
a break in synchronization.

If you can come up with a patch which maintains backward
compatibility e.g. by adding a compatibility layer using
lots of PyUnicode_EncodeEx() APIs, there's good chance of
getting this into the core.

Still, it's lots of work and I'm not sure whether it wouldn't
be more worthwhile adding these sort of special error handling
schemes to the codecs in question rather than making them
a generic option for all codecs.

> Something along the lines of:
> 
> class CodecError:
>     def __call__(self, bytes):
>         pass
> 
> class CodecError_Replace ( CodecError ):
>     def __call__(self, bytes):
>         return u'\uFFFD'
> 
> class CodecError_Strict ( CodecError ):
>     def __call__(self, bytes):
>         raise UnicodeError, "cannot map byte range to Unicode"
> 
> Why would this be useful? I'm working text that purports to be in Big
> 5, but in fact it is encoded with CP950. CP950 is identical to Big 5
> except that it has a handful of extra codepoints in the 0xF9 VDA block
> (taken from the Eten extension). When using the current Big 5 codec on
> these files I sometimes blow up because of these extended
> characters. I would love to be able to do something like:
> 
> class CodecError_CP950 ( Codec_Error_Strict ):
>     def __call__(self, bytes):
>         if bytes == '\xf9\xd6':
>             return u'\u7881'
>         Codec_Error_Strict.__call__(self, bytes)
> 
> This effectively allows me to expand upon the repertoire encoded by a
> the codec without modifying the tables and rebuilding (as I do now as
> a work around), generating new tables, or whatever else.
> 
> Food for thought. The above design is off-the-cuff, but I think it is
> close to my thoughts on the matter.
> 
> OK, flame away.
> 
>     -tree
> 
> --
> Tom Emerson                                          Basis Technology Corp.
> Sr. Sinostringologist                              http://www.basistech.com
>   "Beware the lollipop of mediocrity: lick it once and you suck forever"
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/