[XML-SIG] HTML<->UTF-8 'codec'?
M.-A. Lemburg
mal@lemburg.com
Fri, 19 Oct 2001 20:08:48 +0200
"Fred L. Drake, Jr." wrote:
>
> Bill Janssen writes:
> > First off, this seems like an obvious thing to do, so has someone
> > already done it? Or is there some obvious flaw in the idea which
> > I just haven't seen?
>
> I haven't seen it, either, but it would be really nice. Most people
> don't want to end up with &#...; character references; they'd rather
> have the general entity references.
I've written one of these for a customer; can't release it though.
Note that even though humans tend to like named entities a lot,
numeric entities are usually much easier to handle and parse
(just think of the hoops that are needed to get these thingies
parsed correctly in XML...).
> > Secondly, is there any documentation on the _codecs module, which
> > seems full of interesting and useful stuff for this purpose?
>
> No. There is limited documentation on the codecs module, though.
> If you'd like to extend that while you're at it, I'd certainly
> appreciate it!
The _codecs module is basically just a helper to make the internal
codecs available. All of these are documented in great detail
in the C API reference and the unicodeobject.h header file.
> > Thirdly, what's the equivalent of chr() for Unicode characters?
>
> unichr() is a built-in function which does this; see the docs if you
> need more information.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/