Python and Regular Expressions

Sat Apr 10 12:21:39 EDT 2010

On Apr 8, 5:13 am, Nobody <nob... at nowhere.com> wrote:
> On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:
> >> Regular expressions != Parsers
>
> > True, but lots of parsers *use* regular expressions in their
> > tokenizers.  In fact, if you have a pure Python parser, you can often
> > get huge performance gains by rearranging your code slightly so that
> > you can use regular expressions in your tokenizer, because that
> > effectively gives you access to a fast, specialized C library that is
> > built into practically every Python interpreter on the planet.

> Unfortunately, a typical regexp library (including Python's) doesn't allow
> you to match against a set of regexps, returning the index of which one
> matched. Which is what you really want for a tokeniser.

Actually, there is some not very-well-documented code in the re module
that will let you do exactly that.  But even not using that code,
using a first cut of re.split() or re.finditer() to break the string
apart into tokens (without yet classifying them) is usually a huge
performance win over anything else in the standard library (or that
you could write in pure Python) for this task.

> >> Every time someone tries to parse nested structures using regular
> >> expressions, Jamie Zawinski kills a puppy.
>
> > And yet, if you are parsing stuff in Python, and your parser doesn't
> > use some specialized C code for tokenization (which will probably be
> > regular expressions unless you are using mxtexttools or some other
> > specialized C tokenizer code), your nested structure parser will be
> > dog slow.
>
> The point is that you *cannot* match arbitrarily-nested expressions using
> regexps. You could, in theory, write a regexp which will match any valid
> syntax up to N levels of nesting, for any finite N. But in practice, the
> regexp is going to look horrible (and is probably going to be quite
> inefficient if the regexp library uses backtracking rather than a DFA).

Trust me, I already knew that.  But what you just wrote is a much more
useful thing to tell the OP than "Every time someone tries to parse
nested structures using regular expressions, Jamie Zawinski kills a
puppy" which is what I was responding to.  And right after
regurgitating that inside joke, Chris Rebert then went on to say "Try
using an *actual* parser, such as Pyparsing".  Which is all well and
good, except then the OP will download pyparsing, take a look, realize
that it uses regexps under the hood, and possibly be very confused.

> Even tokenising with Python's regexp interface is inefficient if the
> number of token types is large, as you have to test against each regexp
> sequentially.

It's not that bad if you do it right.  You can first rip things apart,
then use a dict-based scheme to categorize them.

> Ultimately, if you want an efficient parser, you need something with a C
> component, e.g. Plex.

There is no doubt that you can get better performance with C than with
Python.  But, for a lot of tasks, the Python performance is
acceptable, and, as always, algorithm, algorithm, algorithm...

A case in point.  My pure Python RSON parser is faster on my computer
on a real-world dataset of JSON data than the json decoder that comes
with Python 2.6, *even with* the json decoder's C speedups enabled.

Having said that, the current subversion pure Python simplejson parser
is slightly faster than my RSON parser, and the C reimplementation of
the parser in current subversion simplejson completely blows the doors
off my RSON parser.

So, a naive translation to C, even by an experienced programmer, may
not do as much for you as an algorithm rework.

Regards,
Pat