[XML-SIG] Re: [4suite] Output encodings again
Tue, 12 Sep 2000 14:30:36 +0200
"Martin v. Loewis" wrote:
> [for i18n readers: the issue is to convert u"\u000A9\u01A9" to latin-1,
> so that it comes out as "\251A9;"]
> > Currently, on output to XML (and HTML), we first convert the UTF-8 that
> > the DOM uses into Martin von Lowis's wchar type.
> It may be the time to slowly retire this type. It is still needed for
> 1.5 installations, but the 1.6/2.0 type has a comparable feature set
> yet an interface that is here to stay; plus it offers quite some
> additional feature.
> Still, I believe it shares this problem with my type.
> > So I'm rather at a loss as to how to efficiently escape such characters
> > for XML output. I know I want to render them as &#???;, but every
> > method I see for doing so is rather wasteful.
> In principle, the approach should be introduce new encodings. That is,
> you get latin-1-xml, latin-2-xml, koi-8r-xml, utf-8-xml, and so on.
> These encodings are the same as the original ones, except that they
> have different error handling. This approach is possible both with my
> type and with the 2.0 type - however, implementing these encodings is
> quite some effort.
It's not really all that hard to write codecs for Python 2.0.
You'll have to do two things:
1. write the codec by subclassing the base classes in codecs.py
2. write a search function which returns the needed constructors
You will then have to register the search function using the APIs
in codecs.py. After having done that, the codec will be accessible via
the usual 2.0 methods, e.g. .encode() and unicode().
Documentation is available in codecs.py itself, the various codecs
in the encodings/ package directory and Misc/unicode.txt.
For a good pure-Python implementation built using these techniques
have a look at the Japanese codecs which were recently announced
on the i18n sig-list.
Python Pages: http://www.lemburg.com/python/