
[Skip Montanaro]
The SpamAssassin folks are starting to look at Flex for much faster regular expression matching in situations where large numbers of static re's must be matched.
This problem was also vivid in `procmail'-based SPAM filtering, as I observed it many years ago, and I remembered having quite lurked on the side of Flex at the time. I finally solved my own problem by writing a Python program able to sustain thousands of specialised rules rather speedily, proving once again that algorithmic approaches are often much slower than languages :-).
I wonder if using something like SciPy's weave tool would make it (relatively) painless to incorporate fairly high-speed scanners into Python programs.
For a pure Python solution, PLEX could also be an avenue. It compiles a fast automaton, similar to what Flex does, from a grammar describing all tokens. I tried PLEX recently and was satisfied, even if not astonished by the speed. Also wanting fast parsing, I first used various Python heuristics, but the result was not solid enough to my taste. So I finally opted for Bison, and once on that road, it was just natural to rewrite the PLEX part in Flex. [Guido van Rossum]
I think you're exaggerating the problem, or at least underestimating the re module. The re module is pretty fast!
There are limits to what a backtracking matcher could do, speed-wise, when there are many hundreds of alternated patterns. [Eric S. Raymond]
It's pretty simple, actually. Lexing *is* tokenizing; it's breaking the input stream into appropopriate lexical units. [...] Parsing, on the other hand, consists of attempting to match your input to a grammar. The result of a parse is typically either "this matches" or to throw an error. There are two kinds of parsers -- event generators and structure builders.
Maybe lexing matches a grammar to a point, generates an event according to the match, advances the cursor and repeats until the end-of-file is met. Typically, parsing matches a grammar at point and does not repeat. In some less usual applications, they may be successive lexing stages, each taking its input from the output of the previous one. Parsing may be driven in stages too. I guess that we use the word `lexer' when the output structure is more flat, and `parser' when the output structure is more sexy! There are cases where the distinction is almost fuzzy. [Skip Montanaro]
Guido> I haven't given up on the re module for fast scanners (see Guido> Tim's note on the speed of tokenizing 20,000 messages in Guido> mere minutes). Note that the Bayes approach doesn't *need* Guido> a trick to apply many regexes in parallel to the text.
Right. I'm thinking of it in situations where you do need such tricks. SpamAssassin is one such place. I think Eric has an application (quickly tokenizing the data produced by an external program, where the data can run into several hundreds of thousands of lines) where this might be beneficial as well.
Heap queues could be useful here as well. Consider you have hundreds of regexps to match in parallel on a big text. When the regexp is not too complex, it is more easy or likely that there exists a fast searcher for it. Let's build one searcher per regexp. Always beginning from the start of text, find the spot where each regexp first matches. Build a heap queue using the position as key, and both the regexp and match data as value. The top of the heap is the spot of your first match, process it, and while removing it from the heap, search forward from that spot for the same regexp, and add any result back to the heap. Merely repeat until the heap is empty. I'm not fully sure, but I have the intuition the above could be advantageous. -- François Pinard http://www.iro.umontreal.ca/~pinard