[XML-SIG] Character classes

Martin v. Loewis martin@v.loewis.de
Sat, 12 Jan 2002 21:20:35 +0100


> The XML spec doesn't mention a specific Unicode version.

It does, see below.

> OTOH, Letter is defined explicitly without reference to the
> Unicode database:
> 
> 	http://www.w3.org/TR/REC-xml#NT-Letter

The text below these productions makes specific reference to a
specific Unicode version, and the Unicode database:

# The character classes defined here can be derived from the Unicode
# 2.0 character database as follows:
#
# * Name start characters must have one of the categories Ll, Lu, Lo,
#   Lt, Nl.
#
# * Name characters other than Name-start characters must have one of
#   the categories Mc, Me, Mn, Lm, or Nd.
# ...

> In that case, I suppose you ought to simply create a database
> similar to that used by unicodectype.c which uses the explicit
> character ranges defined in the XML spec as reference and
> provides API for querying isLetter(), isBaseChar() etc.

Python 2.2 was specifically enhanced to efficiently process large
character classes in regular expressions. So this is what PyXML uses.

Regards,
Martin