[I18n-sig] XML entity codec (Re: Output encodings again)

M.-A. Lemburg mal@lemburg.com
Wed, 29 Nov 2000 09:46:24 +0100


uche.ogbuji@fourthought.com wrote:
> 
> MAL and MvL Earlier...
> 
> > > > It's not really all that hard to write codecs for Python 2.0.
> > > >
> > > > You'll have to do two things:
> > > > 1. write the codec by subclassing the base classes in codecs.py
> > > > 2. write a search function which returns the needed constructors
> > > >    and functions.
> > >
> > > So how would I write a codec that converts all characters to Latin-1,
> > > and converts those out of latin-1 to &#xxx; (instead of the
> > > replacement character)? I'd need knowledge about what character are in
> > > Latin-1, and I'd need to do conversion on a character-by-character
> > > basis, right?
> >
> > Right.
> >
> > > And I can't possible use any of the _codecs helper
> > > functions?
> >
> > You could play some tricks with the character mapping codec
> > which is used by all code page codecs.
> >
> > You will achieve better performance with a native codec written
> > in C though.
> >
> > > This is certainly feasible if I want it for a single character set,
> > > but now if I want to do it wholesale for the entire set of character
> > > sets supported by Python 2.0.
> >
> > This is probably not possible since there's no way to have the
> > codecs use e.g. a callback function to handle error situations.
> >
> > But the situation is not all that bad: most codecs rely on the
> > character mapping codec and you could simply implement a new
> > version of it which does the XML escaping instead of raising
> > errors.
> 
> OK.  I began tackling this and gave all the sources a once-over.  I think I
> have a decent idea how to write a codec, but I'm not sure how the character
> map codec fits in.  I've looked at charmap.py, and maybe I'm cross-eyed, but
> inspiration isn't coming to me.
> 
> Might I have any pointers?  Any cheat-sheets?  I'll probably be implementing
> in C.

I think the easiest would be using the C implementation of the
charmap codec as template and working your way onward from there.

The codec provides all necessary functionality except that it
can't handle n-1 and 1-n encodings which you would obviously
need for XML entities. Shouldn't be hard to add though...

BTW, I think such a codec would make a good addition to the
standard lib's encodings package, so if you are willing to
contribute this, I'd promote adding it to the package.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/