[XML-SIG] Character classes

M.-A. Lemburg mal@lemburg.com
Fri, 11 Jan 2002 18:52:30 +0100

Andrew Kuchling wrote:
> Appendix B of the XML REC, at
> http://www.w3.org/TR/2000/REC-xml-20001006#CharClasses, specifies the
> Unicode characters that are allowed in element names.  It doesn't look
> like anything in the PyXML package actually implements them, though.
> ...
> Appendix B of the XML REC derives the character classes from the
> Unicode 2.0 character database.  Should we just write out all the
> expressions from Appendix B as regex patterns, or derive them from the
> database?  Note that Python comes with Unicode 3.0, so maybe we can't
> use the database at all!

The Unicode 3.0 database is mostly backward compatible w/r to Unicode 2.0;
except for a few well documented changes. I don't think we should care
about those...
> Also, there doesn't seem to be a C-level API for querying the Unicode
> database, which means there's no easy way to fix sgmlop.c.

What kind of API would you need ? There are plenty APIs in 
Modules/unicodedata.c which we could expose via a PyCObject
(see mxDateTime for an example how this is done).

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/