[Python-Dev] Regular expressions, Unicode etc.

Nick Maclaren nmm1 at cus.cam.ac.uk
Wed Aug 8 11:28:16 CEST 2007


I have needed to push my stack to teach REs (don't ask), and am
taking a look at the RE code.  I may be able to extend it to support
RFE 694374 and (more importantly) atomic groups and possessive
quantifiers.  While I regard such things as revolting beyond belief,
they make a HELL of a difference to the efficiency of recognising
things like HTML tags in a morass of mixed text.

The other approach, which is to stick to true regular expressions,
and wholly or partially convert to DFAs, has already been rendered
impossible by even the limited Perl/PCRE extensions that Python
has adopted.

My first question is whether this would clash with any ongoing
work, including being superseded by any changes in Python 3000.

Note that I am NOT proposing to do a fixed task, but will produce
a proper proposal only when I know what I can achieve for a small
amount of work.  If the SRE engine turns out to be unsuitable to
extend in these ways, I shall quietly abandon the project.



My second one is about Unicode.  I really, but REALLY regard it as
a serious defect that there is no escape for printing characters.
Any code that checks arbitrary text is likely to need them - yes,
I know why Perl and hence PCRE doesn't have that, but let's skip
that.  That is easy to add, though choosing a letter is tricky.
Currently \c and \C, for 'character' (I would prefer 'text' or
'printable', but \t is obviously insane and \P is asking for
incompatibility with Perl and Java).

But attempting to rebuild the Unicode database hasn't worked.
Tools/unicode is, er, a trifle incomplete and out of date.  The
only file I need to change is Objects/unicodetype_db.h, but the
init attempts to run Tools/unicode/makeunicodedata.py have not
been successful.

I may be able to reverse engineer the mechanism enough to get
the files off the Unicode site and run it, but I don't want to
spend forever on it.  Any clues?


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1 at cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679


More information about the Python-Dev mailing list