[XML-SIG] Character classes
M.-A. Lemburg
mal@lemburg.com
Sat, 12 Jan 2002 14:33:08 +0100
Martin v. Loewis wrote:
>>The Unicode 3.0 database is mostly backward compatible w/r to Unicode 2.0;
>>except for a few well documented changes. I don't think we should care
>>about those...
>>
>
> For strict XML conformance, one may want to worry; see
> xml/xmlchargen.
The XML spec doesn't mention a specific Unicode
version. Unicode 3 is mentioned in the spec as well:
http://www.w3.org/TR/REC-xml
OTOH, Letter is defined explicitly without reference to the
Unicode database:
http://www.w3.org/TR/REC-xml#NT-Letter
> Also, it isn't easy to construct the XML character
> classes with just the Python Unicode properties. For example, Python's
> .isalpha() mostly matches XML's BaseChar class, except for the Roman
> numerals, and the ESTIMATED SYMBOL, which got recategorized in 3.0.
>
> For NameChar, the Python Unicode support does not offer anything
> close. The regular expressions \w is a strict superset, but contains
> many characters that match \w but are not NameChars (e.g. SUPERSET TWO).
In that case, I suppose you ought to simply create a database
similar to that used by unicodectype.c which uses the explicit
character ranges defined in the XML spec as reference and
provides API for querying isLetter(), isBaseChar() etc.
Tools/unicode/makeunicodedata.py has the needed tools to
generate such a table, so this shouldn't be too complicated.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/