Python regular expressions just ain't PCRE

John Machin sjmachin at lexicon.net
Sat May 5 17:44:59 EDT 2007


On May 6, 1:52 am, Wiseman <Wiseman1... at gmail.com> wrote:
> On May 5, 5:12 am, "Terry Reedy" <tjre... at udel.edu> wrote:
>
> > I believe the current Python re module was written to replace the Python
> > wrapping of pcre in order to support unicode.
>
> I don't know how PCRE was back then, but right now it supports UTF-8
> Unicode patterns and strings, and Unicode character properties. Maybe
> it could be reintroduced into Python?

"UTF-8 Unicode" is meaningless. Python has internal unicode string
objects, with comprehensive support for converting to/from str (8-bit)
string objects. The re module supports unicode patterns and strings.
PCRE "supports" patterns and strings which are encoded in UTF-8. This
is quite different, a kludge, incomparable. Operations which inspect/
modify UTF-8-encoded data are of interest only to folk who are
constrained to use a language which has nothing resembling a proper
unicode datatype.

>
> At least today, PCRE supports recursion and recursion check,
> possessive quantifiers and once-only subpatterns (disables
> backtracking in a subpattern), callouts (user functions to call at
> given points), and other interesting, powerful features.

The more features are put into a regular expression module, the more
difficult it is to maintain and the more the patterns look like line
noise.

There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.

If you really want to have PCRE functionality in Python, you have a
few options:
(1) create a wrapper for PCRE using e.g. SWIG or pyrex or hand-
crafting
(2) write a PEP, get it agreed, and add the functionality to the re
module
(3) wait until someone does (1) or (2) for free
(4) fund someone to do (1) or (2)

HTH,
John




More information about the Python-list mailing list