[I18n-sig] XML and codecs

M.-A. Lemburg mal@lemburg.com
Tue, 05 Jun 2001 11:02:37 +0200

Walter Doerwald wrote:
> On 01.06.01 at 23:23 M.-A. Lemburg wrote:
> > "Martin v. Loewis" wrote:
> > >
> > > As for XML and encodings, having a convenient mechanism to extend
> > > existing codecs to encode unknown characters as character entities is
> > > much more important, IMO, since that is very difficult to achieve with
> > > the existing API.
> >
> > Until we've found a backward compatible way to fix this, how
> > about adding a new error handling scheme which at least gives
> > the caller enough information to do some smart processing on the
> > input and output, e.g.
> >
> > errors="break":
> >
> >       raise an UnicodeBreakError with argument
> >         (reason, error_position_in_input, work_done_so_far)
> >
> > The caller could then use the information returned
> > by the codec to fix the input data and reuse the already
> > encoded/decoded data to avoid duplicate work.
> How would UTF-16 be handled? I guess without additional
> code multiple BOMs would be generated for a string that
> contains unencodable characters.

Why ? You should know out of the context which byte order is
in current use and thus use the appropriate code UTF-16-LE
or -BE. These don't generate BOMs.
> > This scheme is very simple, but also very effective, since
> > it allows complex error processing to be done in the
> > namespace where the data is being processed (rather than
> > in a callback which wouldn't have access to this namespace).
> A callback could be a class instance with a __call__ method
> and so can have as much state information as it needs.

Sure, but it breaks the current API completely. The above
mechanism is different in that the communication in the error
case is done by means of an exception. While this is not as
fast as a callback it does have some advantages:

* you can write the error handling code in the context using
  the codec

* it enables you to write error handling code at higher levels
  in the calling stack

* it fits in with the current API

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/