[XML-SIG] Character classes
M.-A. Lemburg
mal@lemburg.com
Fri, 11 Jan 2002 18:52:30 +0100
Andrew Kuchling wrote:
>
> Appendix B of the XML REC, at
> http://www.w3.org/TR/2000/REC-xml-20001006#CharClasses, specifies the
> Unicode characters that are allowed in element names. It doesn't look
> like anything in the PyXML package actually implements them, though.
> ...
> Appendix B of the XML REC derives the character classes from the
> Unicode 2.0 character database. Should we just write out all the
> expressions from Appendix B as regex patterns, or derive them from the
> database? Note that Python comes with Unicode 3.0, so maybe we can't
> use the database at all!
The Unicode 3.0 database is mostly backward compatible w/r to Unicode 2.0;
except for a few well documented changes. I don't think we should care
about those...
> Also, there doesn't seem to be a C-level API for querying the Unicode
> database, which means there's no easy way to fix sgmlop.c.
What kind of API would you need ? There are plenty APIs in
Modules/unicodedata.c which we could expose via a PyCObject
(see mxDateTime for an example how this is done).
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/