[XML-SIG] Character classes

Martin v. Loewis martin@v.loewis.de
Sat, 12 Jan 2002 00:57:58 +0100

> Appendix B of the XML REC, at
> http://www.w3.org/TR/2000/REC-xml-20001006#CharClasses, specifies the
> Unicode characters that are allowed in element names.  It doesn't look
> like anything in the PyXML package actually implements them, though.

Sure. Just have a look at xml.utils.characters. xmlproc currently uses
these expressions to implement full Unicode support in XML.

> For example, I've just run into this with 4DOM, where Document.py
> contains:
> #FIXME: should allow combining characters: fix when Python gets Unicode
> g_namePattern = re.compile('[a-zA-Z_:][\w\.\-_:]*\Z')

Yes, that needs to be fixed.

> Document.py would need to be changed, but so would xmlproc and
> doubtless other pieces of code.  

Document.py needs to be changed; xmlproc already is. Python 1.5 is the
hairy issue here, since xml.utils.characters mandates Unicode support.

> Therefore, there should be a separate module containing character
> info that both 4DOM and xmlproc could use.

There is.

> (xml/chars.py?)  But what should chars.py contain?  Strings? (BaseChar
> = "\u0041\u0042...")  Lists of legal characters?  (BaseChar = [0x41,
> 0x42, ...])  Something else?

Both strings and regular expressions.

> Appendix B of the XML REC derives the character classes from the
> Unicode 2.0 character database.  Should we just write out all the
> expressions from Appendix B as regex patterns, or derive them from the
> database?  Note that Python comes with Unicode 3.0, so maybe we can't
> use the database at all!

For strict conformance, we cannot. See utils/xmlchargen.py for a list
of differences.