Does my RE rock, or suck?
mwh at python.net
Fri Jul 9 14:12:00 CEST 2004
Bryan Olson <fakeaddress at nowhere.org> writes:
> Thanks Michael. I still have more investigating to do.
> Michael Hudson wrote:
> > Well, the re module is an NFA engine, not a DFA engine, so I'd expect
> > this re to run in time proportional to the number of triggers.
> I guess I didn't point this out except for my example: I tried
> to make the RE efficient even for a search-and-backtrack engine.
> Not only are there no loop-like things, there are never multiple
> choices that begin with the same character. Any time it
> backtracks, it will fail immediately at every choice point.
Hmm, you sound like you know more about this sort of thing than I do
> In my example I compile the RE:
> '^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'
> If the RE compiler is smart, searching should involve scanning
> the longest prefix that could still match, plus looking at just
> one character for each branch-point along the way. (If it uses
> a jump-table at decision points, it should be truly linear.)
Well, I don't know much about the re compiler. It's written in
Python, though, so it shouldn't be *that* hard to check for
> > "Dozens to hundreds" sounds like a fairly small number in context,
> I agree, though hundreds would make the longest RE string I've
> ever used.
> > but: time it! Only you know what fast enough is.
> It seems fast. Testing doesn't allay my fears of a bad worst-
> case that an adversary might discover.
> > Here's a cute approach that should have running time independent
> > of the number of triggers:
> I've seen that kind of structure called a "trie" (Knuth cites
> the name 'trie' to E. Fredkin, CACM 3; 1960). When I generate
> my RE, I sort of use the same trie structure.
Heh, I'd sort of heard of tries but didn't realize I was implementing
one. I just thought it was a simple minded DFA engine...
> My (trivial) tests so far indicate that the RE form is faster in
> typical cases. Your trie-searcher is nice in that I have much more
> confidence in the worst-case run-time.
Well, the re engine is written in C, so you would hope that it beats
my Python engine...
The reason I original wrote code somewhat like what I put in my first
post is that I needed character-at-a-time processing.
> > Here's a less cute but probably saner approach:
> > this should run in time proportional to the number of different
> > lengths of trigger (and use rather less memory than my first version).
> I'm not convinced. The time to hash a string is proportional to
> the length, so I think searches can take time proportional to
> the sum of all the trigger lengths.
I guess, but the string hashing is done in C and is "probably"
> [I, Bryan asked:]
> >> Am I likely to exceed recursion depth?
> > Not in 2.4 :-) And I doubt it before that, actually.
> But Python is only up to ... oh, the smiley. I get it.
Well, my point was there is no recursion limit in Python 2.4's sre
engine. But from my limited understanding, I don't think this type of
regular expression causes any version of sre to recurse.
Need to Know is usually an interesting UK digest of things that
happened last week or might happen next week. [...] This week,
nothing happened, and we don't care.
-- NTK Now, 2000-12-29, http://www.ntk.net/
More information about the Python-list