[Python-Dev] Re: Automatic flex interface for Python?

21 Aug 2002 17:14:42 -0400

[Skip Montanaro]

> The SpamAssassin folks are starting to look at Flex for much faster
> regular expression matching in situations where large numbers of static
> re's must be matched.

This problem was also vivid in `procmail'-based SPAM filtering, as I observed
it many years ago, and I remembered having quite lurked on the side of Flex
at the time.  I finally solved my own problem by writing a Python program
able to sustain thousands of specialised rules rather speedily, proving once
again that algorithmic approaches are often much slower than languages :-).

> I wonder if using something like SciPy's weave tool would make it
> (relatively) painless to incorporate fairly high-speed scanners into
> Python programs.

For a pure Python solution, PLEX could also be an avenue.  It compiles a fast
automaton, similar to what Flex does, from a grammar describing all tokens.
I tried PLEX recently and was satisfied, even if not astonished by the speed.
Also wanting fast parsing, I first used various Python heuristics, but the
result was not solid enough to my taste.  So I finally opted for Bison,
and once on that road, it was just natural to rewrite the PLEX part in Flex.

[Guido van Rossum]

> I think you're exaggerating the problem, or at least underestimating
> the re module.  The re module is pretty fast!

There are limits to what a backtracking matcher could do, speed-wise,
when there are many hundreds of alternated patterns.

[Eric S. Raymond]

> It's pretty simple, actually.  Lexing *is* tokenizing; it's breaking the
> input stream into appropopriate lexical units.  [...]  Parsing, on the
> other hand, consists of attempting to match your input to a grammar.
> The result of a parse is typically either "this matches" or to throw
> an error.  There are two kinds of parsers -- event generators and
> structure builders.

Maybe lexing matches a grammar to a point, generates an event according to
the match, advances the cursor and repeats until the end-of-file is met.
Typically, parsing matches a grammar at point and does not repeat.

In some less usual applications, they may be successive lexing stages,
each taking its input from the output of the previous one.  Parsing may
be driven in stages too.  I guess that we use the word `lexer' when the
output structure is more flat, and `parser' when the output structure is
more sexy!  There are cases where the distinction is almost fuzzy.

[Skip Montanaro]

>     Guido> I haven't given up on the re module for fast scanners (see
>     Guido> Tim's note on the speed of tokenizing 20,000 messages in
>     Guido> mere minutes).  Note that the Bayes approach doesn't *need*
>     Guido> a trick to apply many regexes in parallel to the text.

> Right.  I'm thinking of it in situations where you do need such tricks.
> SpamAssassin is one such place.  I think Eric has an application (quickly
> tokenizing the data produced by an external program, where the data can
> run into several hundreds of thousands of lines) where this might be
> beneficial as well.

Heap queues could be useful here as well.  Consider you have hundreds of
regexps to match in parallel on a big text.  When the regexp is not too
complex, it is more easy or likely that there exists a fast searcher for it.
Let's build one searcher per regexp.  Always beginning from the start of
text, find the spot where each regexp first matches.  Build a heap queue
using the position as key, and both the regexp and match data as value.

The top of the heap is the spot of your first match, process it, and while
removing it from the heap, search forward from that spot for the same regexp,
and add any result back to the heap.  Merely repeat until the heap is empty.

I'm not fully sure, but I have the intuition the above could be advantageous.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard