[XML-SIG] Character classes
Martin v. Loewis
martin@v.loewis.de
Sat, 12 Jan 2002 21:20:35 +0100
> The XML spec doesn't mention a specific Unicode version.
It does, see below.
> OTOH, Letter is defined explicitly without reference to the
> Unicode database:
>
> http://www.w3.org/TR/REC-xml#NT-Letter
The text below these productions makes specific reference to a
specific Unicode version, and the Unicode database:
# The character classes defined here can be derived from the Unicode
# 2.0 character database as follows:
#
# * Name start characters must have one of the categories Ll, Lu, Lo,
# Lt, Nl.
#
# * Name characters other than Name-start characters must have one of
# the categories Mc, Me, Mn, Lm, or Nd.
# ...
> In that case, I suppose you ought to simply create a database
> similar to that used by unicodectype.c which uses the explicit
> character ranges defined in the XML spec as reference and
> provides API for querying isLetter(), isBaseChar() etc.
Python 2.2 was specifically enhanced to efficiently process large
character classes in regular expressions. So this is what PyXML uses.
Regards,
Martin