[XML-SIG] Character classes

Martin v. Loewis martin@v.loewis.de
Sat, 12 Jan 2002 01:08:26 +0100


> The Unicode 3.0 database is mostly backward compatible w/r to Unicode 2.0;
> except for a few well documented changes. I don't think we should care
> about those...

For strict XML conformance, one may want to worry; see
xml/xmlchargen. Also, it isn't easy to construct the XML character
classes with just the Python Unicode properties. For example, Python's
.isalpha() mostly matches XML's BaseChar class, except for the Roman
numerals, and the ESTIMATED SYMBOL, which got recategorized in 3.0.

For NameChar, the Python Unicode support does not offer anything
close. The regular expressions \w is a strict superset, but contains
many characters that match \w but are not NameChars (e.g. SUPERSET TWO).
>  
> What kind of API would you need ? There are plenty APIs in 
> Modules/unicodedata.c which we could expose via a PyCObject
> (see mxDateTime for an example how this is done).

The information that you need for an XML parser are simply not
available. For sgmlop, it would be best to copy the approach that
pyexpat uses, see extensions/expat/lib/nametab.h.

Regards,
Martin