(mostly-)POSIX regular expressions

Sébastien Boisgérault Sebastien.Boisgerault at gmail.com
Mon May 29 14:26:54 CEST 2006

John Machin wrote:
> On 29/05/2006 7:46 AM, Sébastien Boisgérault wrote:
> > Paddy a écrit :
> >
> >> maybe this: http://www.pcre.org/pcre.txt and ctypes might work for you?
> >
> > Well finally, it doesn't fit. What I need is a "longest match" policy
> > in
> > patterns like "(a)|(b)|(c)" and NOT a "left-to-right" policy.
> > Additionaly,
> > I need to be able to obtain the matched ("captured") substring and
> > the PCRE does not allow this in DFA mode.
> >
> Perhaps you might like to be somewhat more precise with your
> requirements.

Sure. More on this below.

> "POSIX-compliant" made me think of yuckies like [:fubar:]
> in character classes :-)

Yep. I do not need POSIX *syntax* for regular expressions but POSIX
*semantics*, at least the "leftmost-longest" part (in contrast to the
"first then longest" used in Python, Perl, .NET, etc.)

> The operands of | are such that the length is not fixed and so you can't
> write them in descending length order? Care to tell us some more detail
> about those operands?

Basically, I'd like to use the (excellent) python module SPARK
of John Aycock to build an (extended) C lexer. To do so, I need
to specify the patterns that match my tokens as well as a priority
between them. SPARK then builds a big alternate list of patterns
that begins with the high priority patterns and ends with the low
priority patterns and runs a match.

The problem with to be very careful and to specify explicitely the
priorities to get the desired results: "<=" shall be higher than "<",
decimal stuff higher than integer, etc, when most of the time what
you really want is to match the longest pattern ...

Worse, the priority work-around does not work well when you
compare keywords and (other) identifiers. To match "fortune"
as a identifier, you would need to define identifier with a higher
priority than keyword and it is a problem: "for" would be then
match as a identifier when it is a keyword.

I can come up with possible work-arounds for the "id vs
keyword" issue, but nothing that really makes me happy ...
Therefore, I was studying the possible replacement of the
Python native regular expression engine with a "POSIX
semantics" regular expression engine that would give the
longest match and avoid me a lot of extra work ...

I hope it's clearer now :)

Any advice ?



More information about the Python-list mailing list