Most efficient method to search text?

Tue Oct 22 07:24:23 EDT 2002

Tim Peters <tim.one at comcast.net> writes:

> [Michael Hudson, searching for words]
> > ...
> > Indeed, my code canes re when the wordlist gets long.  Here's some
> > numbers ...
> > ...
> > So roughly linear behaviour from re, n-independent behaviour from my
> > code.  Isn't it nice when things do what you expect?
> 
> When you're young, it can seem that way.  At my age, it mostly makes one
> nervous, waiting for the other and always larger and much smellier foot to
> fall <wink>.

Well, as I'm only doing this for fun, when it gets squashed by a heavy
foot I'll just give up and do something else :)

> > On the other hand, the "compiler" code I posted yesterday consumes
> > vast gobs of memory -- trying to compile a 10000 long list of random
> > words got the OOM killer into action.  A version using lists seems
> > slightly better in this respect, though slower in execution.
> 
> Which is, of course, the traditional tradeoff between NFA and DFA
> approaches -- in bad cases you can pay in memory and preprocessing time due
> to exponential state explosion, or you can pay at runtime.

I don't think I'm suffering from state explosion, in as much as I have
a version that uses a similar algorithm but different data structures
(flat lists) which goes *very* much quicker -- faster than the sre
compiler (tho' given that that does more and is written in Python, I
guess that's not a major acheivement).

I don't know where the other compiler is spending it's time, and all
trying to use hotshot to tell did was get me worried about what I've
done to its overhead.

> I see Bengt discovered how to use dicts for this, and that's probably what
> you'll use in the future too <wink>.

Nah, far too boring.

> For the more general case of lexing, you're searching for patterns
> (like r'[A-Za-z_]\w*'), and twisting a dict gets at best obscure.

I'd bet.

> There's a large literature now on compromise approaches, building
> only the parts of the DFA that the target string actually needs to
> explore, in a dynamic way.  This sounds horribly expensive at first
> glance (don't think about that mixing eyes with ears doesn't make
> sense), but in some practical cases there are exceedingly clever
> ways to "fake it" with a handful of machine-level instructions per
> character, using integers as bit vectors representing the *set* of
> NFA nodes currently active.

Hmm, I'm basically implementing a (limited) regular expression engine
-- FOR FUN!  I *really* need to get out more...

> Gonzo speeding of all arguably common cases of pattern-matching is a
> lifetime job.  Although there's nothing unique about pattern-matching in
> that ...
> 
> learning-to-live-with-fast-enough-leaves-more-time-for-life-ly y'rs  - tim

Shh!

Cheers,
M.

-- 
31. Simplicity does not precede complexity, but follows it.
  -- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html