Python and Regular Expressions
pmaupin at gmail.com
Sat Apr 10 18:21:39 CEST 2010
On Apr 8, 5:13 am, Nobody <nob... at nowhere.com> wrote:
> On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:
> >> Regular expressions != Parsers
> > True, but lots of parsers *use* regular expressions in their
> > tokenizers. In fact, if you have a pure Python parser, you can often
> > get huge performance gains by rearranging your code slightly so that
> > you can use regular expressions in your tokenizer, because that
> > effectively gives you access to a fast, specialized C library that is
> > built into practically every Python interpreter on the planet.
> Unfortunately, a typical regexp library (including Python's) doesn't allow
> you to match against a set of regexps, returning the index of which one
> matched. Which is what you really want for a tokeniser.
Actually, there is some not very-well-documented code in the re module
that will let you do exactly that. But even not using that code,
using a first cut of re.split() or re.finditer() to break the string
apart into tokens (without yet classifying them) is usually a huge
performance win over anything else in the standard library (or that
you could write in pure Python) for this task.
> >> Every time someone tries to parse nested structures using regular
> >> expressions, Jamie Zawinski kills a puppy.
> > And yet, if you are parsing stuff in Python, and your parser doesn't
> > use some specialized C code for tokenization (which will probably be
> > regular expressions unless you are using mxtexttools or some other
> > specialized C tokenizer code), your nested structure parser will be
> > dog slow.
> The point is that you *cannot* match arbitrarily-nested expressions using
> regexps. You could, in theory, write a regexp which will match any valid
> syntax up to N levels of nesting, for any finite N. But in practice, the
> regexp is going to look horrible (and is probably going to be quite
> inefficient if the regexp library uses backtracking rather than a DFA).
Trust me, I already knew that. But what you just wrote is a much more
useful thing to tell the OP than "Every time someone tries to parse
nested structures using regular expressions, Jamie Zawinski kills a
puppy" which is what I was responding to. And right after
regurgitating that inside joke, Chris Rebert then went on to say "Try
using an *actual* parser, such as Pyparsing". Which is all well and
good, except then the OP will download pyparsing, take a look, realize
that it uses regexps under the hood, and possibly be very confused.
> Even tokenising with Python's regexp interface is inefficient if the
> number of token types is large, as you have to test against each regexp
It's not that bad if you do it right. You can first rip things apart,
then use a dict-based scheme to categorize them.
> Ultimately, if you want an efficient parser, you need something with a C
> component, e.g. Plex.
There is no doubt that you can get better performance with C than with
Python. But, for a lot of tasks, the Python performance is
acceptable, and, as always, algorithm, algorithm, algorithm...
A case in point. My pure Python RSON parser is faster on my computer
on a real-world dataset of JSON data than the json decoder that comes
with Python 2.6, *even with* the json decoder's C speedups enabled.
Having said that, the current subversion pure Python simplejson parser
is slightly faster than my RSON parser, and the C reimplementation of
the parser in current subversion simplejson completely blows the doors
off my RSON parser.
So, a naive translation to C, even by an experienced programmer, may
not do as much for you as an algorithm rework.
More information about the Python-list