Python and Regular Expressions

Thu Apr 8 06:13:00 EDT 2010

On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:

>> Regular expressions != Parsers
> 
> True, but lots of parsers *use* regular expressions in their
> tokenizers.  In fact, if you have a pure Python parser, you can often
> get huge performance gains by rearranging your code slightly so that
> you can use regular expressions in your tokenizer, because that
> effectively gives you access to a fast, specialized C library that is
> built into practically every Python interpreter on the planet.

Unfortunately, a typical regexp library (including Python's) doesn't allow
you to match against a set of regexps, returning the index of which one
matched. Which is what you really want for a tokeniser.

>> Every time someone tries to parse nested structures using regular
>> expressions, Jamie Zawinski kills a puppy.
> 
> And yet, if you are parsing stuff in Python, and your parser doesn't
> use some specialized C code for tokenization (which will probably be
> regular expressions unless you are using mxtexttools or some other
> specialized C tokenizer code), your nested structure parser will be
> dog slow.

The point is that you *cannot* match arbitrarily-nested expressions using
regexps. You could, in theory, write a regexp which will match any valid
syntax up to N levels of nesting, for any finite N. But in practice, the
regexp is going to look horrible (and is probably going to be quite
inefficient if the regexp library uses backtracking rather than a DFA).

Even tokenising with Python's regexp interface is inefficient if the
number of token types is large, as you have to test against each regexp
sequentially.

Ultimately, if you want an efficient parser, you need something with a C
component, e.g. Plex.