[XML-SIG] Character classes
Martin v. Loewis
Sat, 12 Jan 2002 00:57:58 +0100
> Appendix B of the XML REC, at
> http://www.w3.org/TR/2000/REC-xml-20001006#CharClasses, specifies the
> Unicode characters that are allowed in element names. It doesn't look
> like anything in the PyXML package actually implements them, though.
Sure. Just have a look at xml.utils.characters. xmlproc currently uses
these expressions to implement full Unicode support in XML.
> For example, I've just run into this with 4DOM, where Document.py
> #FIXME: should allow combining characters: fix when Python gets Unicode
> g_namePattern = re.compile('[a-zA-Z_:][\w\.\-_:]*\Z')
Yes, that needs to be fixed.
> Document.py would need to be changed, but so would xmlproc and
> doubtless other pieces of code.
Document.py needs to be changed; xmlproc already is. Python 1.5 is the
hairy issue here, since xml.utils.characters mandates Unicode support.
> Therefore, there should be a separate module containing character
> info that both 4DOM and xmlproc could use.
> (xml/chars.py?) But what should chars.py contain? Strings? (BaseChar
> = "\u0041\u0042...") Lists of legal characters? (BaseChar = [0x41,
> 0x42, ...]) Something else?
Both strings and regular expressions.
> Appendix B of the XML REC derives the character classes from the
> Unicode 2.0 character database. Should we just write out all the
> expressions from Appendix B as regex patterns, or derive them from the
> database? Note that Python comes with Unicode 3.0, so maybe we can't
> use the database at all!
For strict conformance, we cannot. See utils/xmlchargen.py for a list