Most efficient method to search text?
Michael Hudson
mwh at python.net
Tue Oct 22 07:24:23 EDT 2002
Tim Peters <tim.one at comcast.net> writes:
> [Michael Hudson, searching for words]
> > ...
> > Indeed, my code canes re when the wordlist gets long. Here's some
> > numbers ...
> > ...
> > So roughly linear behaviour from re, n-independent behaviour from my
> > code. Isn't it nice when things do what you expect?
>
> When you're young, it can seem that way. At my age, it mostly makes one
> nervous, waiting for the other and always larger and much smellier foot to
> fall <wink>.
Well, as I'm only doing this for fun, when it gets squashed by a heavy
foot I'll just give up and do something else :)
> > On the other hand, the "compiler" code I posted yesterday consumes
> > vast gobs of memory -- trying to compile a 10000 long list of random
> > words got the OOM killer into action. A version using lists seems
> > slightly better in this respect, though slower in execution.
>
> Which is, of course, the traditional tradeoff between NFA and DFA
> approaches -- in bad cases you can pay in memory and preprocessing time due
> to exponential state explosion, or you can pay at runtime.
I don't think I'm suffering from state explosion, in as much as I have
a version that uses a similar algorithm but different data structures
(flat lists) which goes *very* much quicker -- faster than the sre
compiler (tho' given that that does more and is written in Python, I
guess that's not a major acheivement).
I don't know where the other compiler is spending it's time, and all
trying to use hotshot to tell did was get me worried about what I've
done to its overhead.
> I see Bengt discovered how to use dicts for this, and that's probably what
> you'll use in the future too <wink>.
Nah, far too boring.
> For the more general case of lexing, you're searching for patterns
> (like r'[A-Za-z_]\w*'), and twisting a dict gets at best obscure.
I'd bet.
> There's a large literature now on compromise approaches, building
> only the parts of the DFA that the target string actually needs to
> explore, in a dynamic way. This sounds horribly expensive at first
> glance (don't think about that mixing eyes with ears doesn't make
> sense), but in some practical cases there are exceedingly clever
> ways to "fake it" with a handful of machine-level instructions per
> character, using integers as bit vectors representing the *set* of
> NFA nodes currently active.
Hmm, I'm basically implementing a (limited) regular expression engine
-- FOR FUN! I *really* need to get out more...
> Gonzo speeding of all arguably common cases of pattern-matching is a
> lifetime job. Although there's nothing unique about pattern-matching in
> that ...
>
> learning-to-live-with-fast-enough-leaves-more-time-for-life-ly y'rs - tim
Shh!
Cheers,
M.
--
31. Simplicity does not precede complexity, but follows it.
-- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html
More information about the Python-list
mailing list